I need to extract a table of data on a collection of pages. I can already traverse the pages just fine.
How do I extract the table's data? I'm using Ruby and Nokogiri, but I would assume that this is a pretty general issue.
I underlined the desired data points in each row in the following image.
A sample of the html is: http://pastebin.com/YYFPbFLC
How would I parse this table into a hash via Nokogiri into the meaningful chunks?
The table's xpath is:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table
The table has a variable number of rows of data and formatting rows. I only want to collect the rows with meaningful data, but I don't readily see a way to distinguish this via an XPath except the second column will reliably have "keyword" in it. Each of these rows have an XPath of:
1st meaningful row is: /html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]
...
Last meaningful row: /html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[N]
The first meaningful column that needs to match text content on the "keyword" is:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[2]
The last column of this first row of data would be:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[6]
Each row is a record and has a timestamp with this column/td being the time in the timestamp; The year, month and day are all in their own variables and can be appended for a full timestamp:
/html/body/table/tbody/tr/td[2]/table/tbody/tr[2]/td/table/tbody/tr/td[2]/table/tbody/tr/td/table/tbody/tr/td[2]/table/tbody/tr[2]/td[5]
The first rule of XPath is: never use the autogenerated XPath from Firebug or other browser tool. This creates brittle XPath that treats all page elements as equally important and required, even parts you don't care about. For example, if a notice went up at the top of the page and it happened to be in a table, it could throw off your parsing.
Instead, think about how a human would identify it. In this case, you want "the first table under the heading with the word 'today' in it". Here's the XPath for that:
//table[preceding-sibling::h2[contains(text(), "today")]][1]
This says take the tables that have a preceding h2 (in other words, that follow the h2), where the h2 contains the word "today". Then take the first such table.
Then you need to identify the rows you are interested in. Note that some rows are just dividers containing a single td, so you want to make sure you only parse the rows that have multiple td tags. In XPath, that is:
//tr[td[2]]
Then you just grab the content of all the columns. In the first one you can remove everything before the words "of magnitude" to get just the value. Putting it all together:
doc = Nokogiri::HTML.parse(html)
events = []
doc.xpath('//table[preceding-sibling::h2[contains(text(), "today")]][1]//tr[td[2]]').each do |row|
cols = row.search('td/text()').map(&:to_s)
events << {
:magnitude => cols[0].gsub(/^.*of magnitude /,''),
:temp_area => cols[1],
:time_start => cols[2],
:time_middle => cols[3],
:time_end => cols[4]
}
end
The output is:
[
{:magnitude=>"F1.7",
:temp_area=>"0",
:time_start=>"01:11:00",
:time_middle=>"01:24:00",
:time_end=>"01:32:00"},
{:magnitude=>"F3.1",
:temp_area=>"0",
:time_start=>"04:01:00",
:time_middle=>"04:10:00",
:time_end=>"04:26:00"},
{:magnitude=>"F3.5",
:temp_area=>"134F55",
:time_start=>"06:24:00",
:time_middle=>"06:42:00",
:time_end=>"06:53:00"},
{:magnitude=>"F1.4",
:temp_area=>"0",
:time_start=>"11:58:00",
:time_middle=>"12:06:00",
:time_end=>"12:16:00"},
{:magnitude=>"F1.0",
:temp_area=>"0",
:time_start=>"13:02:00",
:time_middle=>"13:05:00",
:time_end=>"13:09:00"},
{:magnitude=>"D53.7",
:temp_area=>"134F55",
:time_start=>"17:37:00",
:time_middle=>"18:37:00",
:time_end=>"18:56:00"}
]
Related
I am very new to ruby and I want to check for rows with the same number in a csv file.
What I am trying to do is go through the input csv file and copy element from the input file to the output file also adding another column called "duplicate" to the output file, then check if a similar phone is already in the output file while copying data from input to output then if the phone already exist, add "dupl" to the row in the duplicate column.
This is what I have.
file=CSV.read('input_file.csv')
output_file=File.open("output2.csv","w")
for row in file
output_file.write(row)
output_file.write("\n")
end
output_file.close
Example:
Phone
(202) 221-1323
(201) 321-0243
(202) 221-1323
(310) 343-4923
output file
Phone
Duplicate
(202) 221-1323
(201) 321-0243
(202) 221-1323
dupl
(310) 343-4923
So basically you want to write the input to output and append a "dupl" on the second occurrence of a duplicate?
Your input to output seems fine. To get the "dupl" flag, simply count the occurrence of each number in the list. If it's more than one, its a duplicate. But since you only want the flag to be shown on the second occurrence just count how often the number appeared up until that point:
lines = CSV.read('input_file.csv')
lines.each_with_index do |l,i|
output_file.write(l + ",")
if lines.take(i).count(l) >= 1
output_file.write("dupl")
end
output_file.write("\n")
end
l is the current line. take(i) is all lines before but not including the current line and count(l) applied to this counts how often the number appeared before if it's more than one, print a "dupl"
There probably is a more efficient answer to this, this is just a quick and easy to understand version.
I'm using Copy XPath from Chrome to create my queries. It works very well but not for this question.
Here is the site I scrape data from.
One query that works (number next to "Senaste NAV-kurs" in table 1)
=IMPORTXML("http://www.di.se/di-fonder/fonddetaljer/?InstrumentId="&1085603;"//*[#id='fund-summary-wrap']/div[1]/dl[2]/dd/text()" )
But when I copy XPath from table with title "AVKASTNING" i'm getting no data, pls help
=IMPORTXML("http://www.di.se/di-fonder/fonddetaljer/?InstrumentId="&1085603;"//*[#id='ctl00_FourColumnWidthContent_ThreeColumnsContent_MainAndSecondColumnContent_fundInfo_fundPerformance_tableFund']/tbody/tr[4]/td[2]/span" )
If you are willing to try it another way, the following will get the "AVKASTNING" table.
=IMPORTHTML("http://www.di.se/di-fonder/fonddetaljer/?InstrumentId=1085603","table",7)
If you want a specific value from the table use index. The following example gets the second row second column value:
=index(IMPORTHTML("http://www.di.se/di-fonder/fonddetaljer/?InstrumentId=1085603","table",7),2,2)
I'm importing data from old spreadsheets into a database using rails.
I have one column that contains a list on each row, that are sometimes formatted as
first, second
and other times like this
third and fourth
So I wanted to split up this string into an array, delimiting either with a comma or with the word "and". I tried
my_string.split /\s?(\,|and)\s?/
Unfortunately, as the docs say:
If pattern contains groups, the respective matches will be returned in the array as well.
Which means that I get back an array that looks like
[
[0] "first"
[1] ", "
[2] "second"
]
Obviously only the zeroth and second elements are useful to me. What do you recommend as the neatest way of achieving what I'm trying to do?
You can instruct the regexp to not capture the group using ?:.
my_string.split(/\s?(?:\,|and)\s?/)
# => ["first", "second"]
As an aside note
into a database using rails.
Please note this has nothing to do with Rails, that's Ruby.
We are using NexusDB for a small database. We have a table with a FulltextIndex defined on it.
The index is configured with the following options:
Character separator
ccPunctuationDash
ccPunctuationOther
The user enters a search text in an edit box, and then an SQL statement is constructed with the following WHERE clause (%s substituted with the Editbox.text of course):
WHERE CONTAINS(FullIdx, ''%s'')
When the user enters multiple words in the editbox this goes wrong as the two separate words should have been embedded in the WHERE clause like this:
WHERE CONTAINS(FullIdx, 'word1' and 'word2')
So i have to parse the textbox value, scan it for spaces and split the text at those points. That made me wonder if it was possible to parse the search text for every setting of the fulltextindex, using the actual definition of the fulltextindex to create the correct where clause.
So if ccPunctuationDash is enabled in the FulltextIndex definition, than the search text is also split on a '-'.
If you think of it, it is exactly the same process as when the index is created and all strings are tokenized ...
My question: what is the easiest way of tokenizing a searchstring according to the settings of a FUlltextIndex?
The easiest way is... to create an empty #temporary table with a string field, with the same fulltext index settings as your real table. Set the TnxTable.Options to include dsoAddKeyAsVariantField. Load the string to tokenize into the string field, then view the table indexed by the fulltext index. Presto, you get an extra field displayed, which is the sorted tokens. You can now iterate over the table to read the tokens.
I want to make a few customizations to jQuery UI Autcomplete:
1) If there are no results found it should output "no results found" in the list.
2) Is it possible to highlight/bold the letters in the results as they are being typed? For example if I type "ball" and I have "football" in my results it needs to output as foot ball
3) Is it possible for the results that appear at the top to match the beginning of the string. For example suppose I have 3 entries in my database:
Astrologer
Space Station
Star
I start typing "st" - this will bring up those 3 entries in that order. But I want "Star" to be the first result.
The MySQL query being used at the moment to generate the results is:
$query = mysql_query("SELECT id, name FROM customer WHERE name LIKE '%".$_GET['term']."%' ORDER BY name");
You can simply echo 'No results found' inside the script that returns the list if num rows from your mysql_query is 0.
This was possible in the original Autocomplete plugin but I can't see it anywhere in the JQuery UI documentation.
You may have to run two separate mysql queries - the first one looking for LIKE '".$_GET['term']."%' and the second one as you have it, but excluding the results you've already got from the first query.