Parse web scraped data - ruby-on-rails

Parse web scraped data - ruby-on-rails

So i have this coming from my web scrape
pastebin.com/CMrFcBMX
What i'm wanting is all the prices and ticket description. Heres what i have
doc.xpath("//script[#type='text/javascript']/text()").each do |text|
if text.content =~ /more_options_on_polling/
price1 = text.to_s.scan(/\"(formatted_(?:total_price))\":\"(.+?)\"/).uniq
description = text.to_s.scan(/\"(ticket_desc)\":\"(.+?)\"/).uniq
price = price1 + description
render json: price
end
end
So this is what i have at the moment. However i'm needing to do some major edits.
Firstly i'm needing the description to ignore any plus symbols, e.g. Later Owl + Chance For VIP Upgrade\ would need to be ignored.
Secondly I need to remove the json rendering nice, So that the first price and fees match with the first description.
Once i have this rendered i should be sorted. I'd be using this in a js file afterwards so a format like this would be best:
Ticket{
[
Price:
Fees:
Description:
]
}
Once i have it like this i should be good to finish my application ^_^
Thanks
Sam

Related

Rails Testing Contents of CSV Download

I have a project set up to download information from a page as a csv. I'm trying to write a test that ensures the contents of this csv matches what is shown on the site, but I am not sure how to access the contents of the CSV. Here's my current attempt:
test "check_acquired_shares_contents" do
data = ["Label Amount Of Shares Share Price Total Price Occurred On From Shareholder Share Transaction Action Share Transaction Type",
"Initial 100.0 100.0 10000.0 2001-05-06 CREATE ESPP", "On Date 20.0 10.0 200.0 2012-05-06 Bill TRANSFER INDIVIDUAL",
"Label Amount Of Shares Share Price Total Price Occurred On To Shareholder Share Transaction Action Share Transaction Type",
"Individual_transfers 10.0 10.0 100.0 2010-05-06 Bill TRANSFER INDIVIDUAL"]
visit admin_shareholder_path(id: 1)
find('.action_item', :text => 'Acquired Shares CSV').click
#binding.pry
assert data == page.all('tr').map { |tr| tr.text }
end
The data array is what the CSV is meant to contain. However, I think I am not checking the contents of the CSV, because the test always fails.
Anyone have ideas on how to check the CSV contents? Thank you!

Reading CSV file itself would be good to test the exact CSV scenario here
You can do something like below:
require 'csv'
csv_data = CSV.read('csv_file_name.csv', { encoding: 'UTF-8', headers: true, converters: :all, header_converters: :symbol})
data = csv_data.map { |d| d.to_hash }
What u would get is:
[ {:Label=>"Initial", :Share=>"100.0", :Price=>"10000.0",..}, {:Label=>...}.. ]
Now you can easily write your test cases by comparing this data and the one you are expecting.

How to parse JSON with the Oj SAX parser, Saj

I want to parse a 10-20MB JSON file, and figure it's probably a good idea to not parse the entire JSON file at once and cause major memory usage. After looking around it seems like Oj's Saj or ScHandler APIs might be a good fit.
The only problem is that I can't really wrap my head around how to use them, and the documentation doesn't make it much clearer. I've looked at the example in Saj source code, and defined a super simple subclass of Oj::Saj like below:
class MySaj < Oj::Saj
def hash_start(key)
p key
end
end
Used like this:
open(URL) do |contents|
Oj.saj_parse(handler, contents)
end
And this leads to a lot of keys from my JSON being printed out. But I still have no idea how to actually access the values belonging to the keys I'm printing.
Can I access the hash itself somehow, or how am I supposed to do this?

SAX-style parsing is complicated. You have to maintain the state of the parsing, and deal with each state change appropriately.
The hash_start and array_start callbacks, notify your SAX handler that Saj has found the beginning of a hash, and that the next callbacks that occur will be in the context of that hash. Note that hashes may be nested, contain (or be contained within) arrays, or simple values.
Here is a simple Saj handler that parses a very simple JSON object:
require 'oj'
class MySaj < ::Oj::Saj
def initialize()
#hash_cnt = 0
#array_cnt = 0
end
def hash_start(key)
#hash_cnt += 1
puts "Start-Hash[#hash_cnt]: '#{key}'"
end
def hash_end(key)
#hash_cnt -= 1
puts "End-Hash[#hash_cnt]: '#{key}'"
end
def array_start(key)
#array_cnt += 1
puts "Start-Array[#array_cnt]: '#{key}'"
end
def array_end(key)
#array_cnt -= 1
puts "End-Array[#array_cnt]: '#{key}'"
end
def add_value(value, key);
puts "Value: [#{key}] = '#{value}'"
end
def error(message, line, column)
puts "ERRRORRR: #{line}:#{column}: #{message}"
end
end
json = '[{ "key1": "abc", "key2": 123}, { "test1": "qwerty", "pi": 3.14159 }]'
cnt = MySaj.new()
Oj.saj_parse(cnt, json)
The results of this basic JSON parsing with Saj gives this result:
Start-Array[#array_cnt]: ''
Start-Hash[#hash_cnt]: ''
Value: [key1] = 'abc'
Value: [key2] = '123'
End-Hash[#hash_cnt]: ''
Start-Hash[#hash_cnt]: ''
Value: [test1] = 'qwerty'
Value: [pi] = '3.14159'
End-Hash[#hash_cnt]: ''
End-Array[#array_cnt]: ''
You may notice that this output is roughly equivalent to one callback per token (omitting ',' and ':'). You essentially have to build into your callbacks the knowledge of what to do with individual JSON elements. Along those lines, you also need to build the hierarchy described by the callbacks. For example, when hash_start is called, push an empty hash on the stack; when hash_end is called, pop the hash or move back one level in the hierarchy.
For example you could have a handler in hash_end that checks to see if this is ending a top-level hash, and when it is, then do something with that hash. Note that you can often not do this with arrays, as the top-level element in a very large number of JSON documents is an array, so you have to determine when the array is the top+1 level array.
If you like writing compiler backends, this is the JSON parsing solution for you. Personally, I've never enjoyed working in Sax, but for large documents, it can be very resource-friendly and highly performant, depending on how well you write the handler. Be prepared for oodles of debugging and slightly mismatched state management, as that's par for the course with Sax-style parsing.
However, you shouldn't be too concerned with 10-20MB JSON, as that's actually not very large. I've processed 80+MB JSON with "regular" Oj (load and dump) quite a lot, and not had a problem with it. Unless you're running on a severely resource-constrained machine, the standard Oj will work well for you.

Saj is a streaming parser. What that means, in practice, is that it doesn't know a file's contents in their entirety and parses them whole — it instead notifies you of parse events as it encounters them. Your thinking is solid: the larger the file, the more you benefit from parsing in that manner if you wish to pick and choose from it.
hash_start is one such event, fired when Oj sees the beginning of an Object (which will become a Hash in Ruby land).
Take this JSON for instance:
{
"student-1": {
"name": "John Doe",
"age": 42,
"knownAliases": ["Blabby Joe", "Stack Underflow"],
"trainingGrades": {
"Advanced Zumba Dancing": "A+",
"Introduction to Twitter Arguments": "C-"
}
},
"student-2": {
"name": "Rebecca Melecca",
"age": 26,
"knownAliases": ["Booger Becca", "Tanktop Terror"],
"trainingGrades": {
"Intermediate Groin Kickery": "A+",
"Advanced Quantum Mechanics": "A+"
}
}
And the following parser:
class StudentParser < Oj::Saj
def hash_start(key)
puts "hash_start(#{key.inspect})"
end
def hash_end(key)
puts "hash_end(#{key.inspect})"
end
def array_start(key)
puts "array_start(#{key.inspect})"
end
def array_end(key)
puts "array_end(#{key.inspect})"
end
def add_value(value, key)
puts "add_value(#{value.inspect}, #{key.inspect})"
end
end
And you'll get the following sequence of events:
hash_start(nil)
hash_start("student-1")
add_value("John Doe", "name")
add_value(42, "age")
array_start("knownAliases")
add_value("Blabby Joe", nil)
add_value("Stack Underflow", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Advanced Zumba Dancing")
add_value("C-", "Introduction to Twitter Arguments")
hash_end("trainingGrades")
hash_end("student-1")
hash_start("student-2")
add_value("Rebecca Melecca", "name")
add_value(26, "age")
array_start("knownAliases")
add_value("Booger Becca", nil)
add_value("Tanktop Terror", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Intermediate Groin Kickery")
add_value("A+", "Advanced Quantum Mechanics")
hash_end("trainingGrades")
hash_end("student-2")
hash_end(nil)
When you see hash_start(nil), it means the parser has found a top-level object (that very first opening brace). Conversely, hash_end(nil) means that top-level object has been closed, and its innards properly parsed (i.e. no parsing erros have been found).
Parsing in this manner means you have to keep track of nesting, if that's meaningful to you, of adding keys and values at the right value, et cetera. That makes it annoying and hard, but worthwhile if you wish to carve out bits of a large file without committing everything to memory.

Fetch Full Address based on Postal Code via Regex

I would like to grab the address highlighted in red. "Site Location:" can be easily identified via match(). However, how can I grab the highlighted part only without going over proceeding content, i.e., "You have applied...etc". Please note that the proceeding content won't always start with "You have applied".
What I would do is the following:
Look for "Site Location:"
Grab anything after "Site Location:" until you find empty/blank new line.
Can anyone help me achieving it in Ruby?
Note that the whole text is stored in a string variable.

regex = /\bSite Location:\s+(.*?)\n\s*\n/m
str = "Site Location: Raglan Street
Collingwood Town, Some County
You have applied..."
if md = regex.match(str)
address = md[1].strip.gsub(/^\s+/,'')
puts address
end
Output:
Raglan Street
Collingwood Town, Some County
Note: one thing to watch for is the possibility of different types of newlines. E.g. Microsoft may use \r\n\r\n, etc. in which case you may have to adjust regex accordingly.

Generate a link_to on the fly if a URL is found inside the contents of a db text field?

I have an automated report tool (corp intranet) where the admins have a few text area boxes to enter some text for different parts of the email body.
What I'd like to do is parse the contents of the text area and wrap any hyperlinks found with link tags (so when the report goes out there are links instead of text urls).
Is ther a simple way to do something like this without figuring out a way of parsing the text to add link tags around a found (['http:','https:','ftp:] TO the first SPACE after)?
Thank You!
Ruby 1.87, Rails 2.3.5

Make a helper :
def make_urls(text)
urls = %r{(?:https?|ftp|mailto)://\S+}i
html_text = text.gsub urls, '\0'
html_text
end
on the view just call this function , you will get the expected output.
like :
irb(main):001:0> string = 'here is a link: http://google.com'
=> "here is a link: http://google.com"
irb(main):002:0> urls = %r{(?:https?|ftp|mailto)://\S+}i
=> /(?:https?|ftp|mailto):\/\/\S+/i
irb(main):003:0> html = string.gsub urls, '\0'
=> "here is a link: http://google.com"

There are many ways to accomplish your goal. One way would be to use Regex. If you have never heard of regex, this wikipedia entry should bring you up to speed.
For example:
content_string = "Blah ablal blabla lbal blah blaha http://www.google.com/ adsf dasd dadf dfasdf dadf sdfasdf dadf dfaksjdf kjdfasdf http://www.apple.com/ blah blah blah."
content_string.split(/\s+/).find_all { |u| u =~ /^https?:/ }
Which will return: ["http://www.google.com/", "http://www.apple.com/"]
Now, for the second half of the problem, you will use the array returned above to subsititue the text links for hyperlinks.
links = ["http://www.google.com/", "http://www.apple.com/"]
links.each do |l|
content_string.gsub!(l, "<a href='#{l}'>#{l}</a>")
end
content_string will now be updated to contain HTML hyperlinks for all http/https URLs.
As I mentioned earlier, there are numerous ways to tackle this problem - to find the URLs you could also do something like:
require 'uri'
URI.extract(content_string, ['http', 'https'])
I hope this helps you.

Truncate Markdown?

I have a Rails site, where the content is written in markdown. I wish to display a snippet of each, with a "Read more.." link.
How do I go about this? Simple truncating the raw text will not work, for example..
>> "This is an [example](http://example.com)"[0..25]
=> "This is an [example](http:"
Ideally I want to allow the author to (optionally) insert a marker to specify what to use as the "snippet", if not it would take 250 words, and append "..." - for example..
This article is an example of something or other.
This segment will be used as the snippet on the index page.
^^^^^^^^^^^^^^^
This text will be visible once clicking the "Read more.." link
The marker could be thought of like an EOF marker (which can be ignored when displaying the full document)
I am using maruku for the Markdown processing (RedCloth is very biased towards Textile, BlueCloth is extremely buggy, and I wanted a native-Ruby parser which ruled out peg-markdown and RDiscount)
Alternatively (since the Markdown is translated to HTML anyway) truncating the HTML correctly would be an option - although it would be preferable to not markdown() the entire document, just to get the first few lines.
So, the options I can think of are (in order of preference)..
Add a "truncate" option to the maruku parser, which will only parse the first x words, or till the "excerpt" marker.
Write/find a parser-agnostic Markdown truncate'r
Write/find an intelligent HTML truncating function

Write/find an intelligent HTML truncating function
The following from http://mikeburnscoder.wordpress.com/2006/11/11/truncating-html-in-ruby/, with some modifications will correctly truncate HTML, and easily allow appending a string before the closing tags.
>> puts "<p><b>Something</p>".truncate_html(5, at_end = "...")
=> <p><b>Someth...</b></p>
The modified code:
require 'rexml/parsers/pullparser'
class String
def truncate_html(len = 30, at_end = nil)
p = REXML::Parsers::PullParser.new(self)
tags = []
new_len = len
results = ''
while p.has_next? && new_len > 0
p_e = p.pull
case p_e.event_type
when :start_element
tags.push p_e[0]
results << "<#{tags.last}#{attrs_to_s(p_e[1])}>"
when :end_element
results << "</#{tags.pop}>"
when :text
results << p_e[0][0..new_len]
new_len -= p_e[0].length
else
results << "<!-- #{p_e.inspect} -->"
end
end
if at_end
results << "..."
end
tags.reverse.each do |tag|
results << "</#{tag}>"
end
results
end
private
def attrs_to_s(attrs)
if attrs.empty?
''
else
' ' + attrs.to_a.map { |attr| %{#{attr[0]}="#{attr[1]}"} }.join(' ')
end
end
end

Here's a solution that works for me with Textile.
Convert it to HTML
Truncate it.
Remove any HTML tags that got cut in half with
html_string.gsub(/<[^>]*$/, "")
Then, uses Hpricot to clean it up and close unclosed tags
html_string = Hpricot( html_string ).to_s
I do this in a helper, and with caching there's no performance issue.

You could use a regular expression to find a line consisting of nothing but "^" characters:
markdown_string = <<-eos
This article is an example of something or other.
This segment will be used as the snippet on the index page.
^^^^^^^^^^^^^^^
This text will be visible once clicking the "Read more.." link
eos
preview = markdown_string[0...(markdown_string =~ /^\^+$/)]
puts preview

Rather than trying to truncate the text, why not have 2 input boxes, one for the "opening blurb" and one for the main "guts". That way your authors will know exactly what is being show when without having to rely on some sort of funkly EOF marker.

I will have to agree with the "two inputs" approach, and the content writer would need not to worry, since you can modify the background logic to mix the two inputs in one when showing the full content.
full_content = input1 + input2 // perhaps with some complementary html, for a better formatting

Not sure if it applies to this case, but adding the solution below for the sake of completeness. You can use strip_tags method if you are truncating Markdown-rendered contents:
truncate(strip_tags(markdown(article.contents)), length: 50)
Sourced from:
http://devblog.boonecommunitynetwork.com/rails-and-markdown/

A simpler option that just works:
truncate(markdown(item.description), length: 100, escape: false)

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Parse web scraped data - ruby-on-rails

Related

Rails Testing Contents of CSV Download

How to parse JSON with the Oj SAX parser, Saj

Fetch Full Address based on Postal Code via Regex

Generate a link_to on the fly if a URL is found inside the contents of a db text field?

Truncate Markdown?

Categories

Resources