How to Parse LARGE JSON files with formating error

How to Parse LARGE JSON files with formating error - ruby-on-rails

I have a bunch of large JSON files (> 500MB) which I would like to parse with the ruby script (I am trying. to parse it with YAJL gem).
I have noticed that JSON files have formatting errors such that all the files composed of "multiple" JSON objects without a proper tree-like structure or array. Below you can find how the JSON file looks like:
testfile.json:
{title: "Don Quixote", author: "Miguel de Cervantes", printyear: 2010}
{title: "Great Gatsby", author: "F. Scott Fitzgerald", printyear: 2014}
{title: "Ulysses", author: "James Joyce", printyear: 2010}
This is the script to parse file:
require 'yajl'
json = File.new('testfile.json', 'r')
hash = Yajl::Parser.parse(json)
Here is the error message I get:
Yajl::ParseError: Found multiple JSON objects in the stream but no block or the on_parse_complete callback was assigned to handle them.
I will appreciate if you can guide me on how to solve this issue.

The error message you got ("Found multiple JSON objects in the stream …") implies that your input contains multiple but valid JSON objects, so I assume your actual file looks more like this:
{"title":"Don Quixote","author":"Miguel de Cervantes","printyear":2010}
{"title":"Great Gatsby","author":"F. Scott Fitzgerald","printyear":2014}
{"title":"Ulysses","author":"James Joyce","printyear":2010}
One of YAJL's feature is to:
Parse and encode multiple JSON objects to and from streams or strings continuously.
So given above input (as a file or string), you can pass a block to parse which will be called for each parsed object:
require 'yajl'
io = File.open('testfile.json')
Yajl::Parser.parse(io) do |book|
puts "“#{book['title']}” by #{book['author']} (#{book['printyear']})"
end
Output:
“Don Quixote” by Miguel de Cervantes (2010)
“Great Gatsby” by F. Scott Fitzgerald (2014)
“Ulysses” by James Joyce (2010)

Don't use JSON.parse, beacuse the file's content is not a JSON. Each line from this file looks like a Ruby hash, so different parsing method could be used.
You should be able to parse each line by using: YAML.load(line).
Also, beacuse the file is big, don't load the whole file into memory. Use File.foreach to load line by line.
require 'yaml'
lines = []
File.foreach('testfile.json') do |line|
lines << YAML.load(line)
end

Related

Ruby: Is there a way to specify your encoding in File.write?

TL;DR
How would I specify the mode of encoding on File.write, or how would one save image binary to a file in a similar fashion?
More Details
I'm trying to download an image from a Trello card and then upload that image to S3 so it has an accessible URL. I have been able to download the image from Trello as binary (I believe it is some form of binary), but I have been having issues saving this as a .jpeg using File.write. Every time I attempt that, it gives me this error in my Rails console:
Encoding::UndefinedConversionError: "\xFF" from ASCII-8BIT to UTF-8
from /app/app/services/customer_order_status_notifier/card.rb:181:in `write'
And here is the code that triggers that:
def trello_pics
#trello_pics ||=
card.attachments.last(config_pics_number)&.map(&:url).map do |url|
binary = Faraday.get(url, nil, {'Authorization' => "OAuth oauth_consumer_key=\"#{ENV['TRELLO_PUBLIC_KEY']}\", oauth_token=\"#{ENV['TRELLO_TOKEN']}\""}).body
File.write(FILE_LOCATION, binary) # doesn't work
run_me
end
end
So I figure this must be an issue with the way that File.write converts the input into a file. Is there a way to specify encoding?

AFIK you can't do it at the time of performing the write, but you can do it at the time of creating the File object; here an example of UTF8 encoding:
File.open(FILE_LOCATION, "w:UTF-8") do
|f|
f.write(....)
end
Another possibility would be to use the external_encoding option:
File.open(FILE_LOCATION, "w", external_encoding: Encoding::UTF_8)
Of course this assumes that the data which is written, is a String. If you have (packed) binary data, you would use "wb" for openeing the file, and syswrite instead of write to write the data to the file.
UPDATE As engineersmnky points out in a comment, the arguments for the encoding can also be passed as parameter to the write method itself, for instance
IO::write(FILE_LOCATION, data_to_write, external_encoding: Encoding::UTF_8)

Read into Dask from Minio raises issue with reading / converting binary string JSON into utf8

I'm trying to read JSON-LD into Dask from Minio. The pipeline works but the strings come from Minio as binary strings
So
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read())
results in
b'\n{\n "#context": "http://schema.org/",\n "#type": "Dataset",\n ...
I can simply convert this with
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
print(f.read().decode("utf-8"))
and now everything is as I expect it.
However, I am working with Dask and when reading into the a bag with
dgraphs = db.read_text('s3://bucket/prefa/prefb/*.jsonld',
storage_options={
"key": key,
"secret": secret,
"client_kwargs": {"endpoint_url":"https://example.org"}
}).map(json.loads)
I can not get the content coming from Minio to become strings vs binary strings. I need these converted before they hit the json.loads map I suspect.
I assume I can inject the "decode" in here somehow as well, but I can't resolve how.
Thanks

As the name implies, read_text opens the remote file in text mode, equivalent to open(..., 'rt'). The signature of read_text includes the various decoding arguments, such as UTF8 as the default encoding. You should not need to do anything else, but please post a specific error if you are having trouble, ideally with example file contents.
If your data isn't delimited by lines, read_text might not be right for you, and you can do something like
#dask.delayed()
def read_a_file(fn):
# or preferably open in text mode and json.load from the file
with oss.open('gleaner/summoned/repo/file.jsonld', 'rb') as f:
return json.loads(f.read().decode("utf-8"))
output = [read_a_file(f) for f in filenames]
and then you can create a bag or dataframe from this, as required.

How do I parse an Excel file that will give me data exactly as it appears visually?

I'm on Rails 5 (Ruby 2.4). I want to read an .xls doc and I would like to get the data into CSV format, just as it appears in the Excel file. Someone recommended I use Roo, and so I have
book = Roo::Spreadsheet.open(file_location)
sheet = book.sheet(0)
text = sheet.to_csv
arr_of_arrs = CSV.parse(text)
However what is getting returned is not the same as what I see in the spreadsheet. For isntance, a cell in the spreadsheet has
16:45.81
and when I get the CSV data from above, what is returned is
"0.011641319444444444"
How do I parse the Excel doc and get exactly what I see? I don't care if I use Roo to parse or not, just as long as I can get CSV data that is a representation of what I see rather than some weird internal representation. For reference the file type I was parsing givies this when I run "file name_of_file.xls" ...
Composite Document File V2 Document, Little Endian, Os: Windows, Version 5.1, Code page: 1252, Author: Dwight Schroot, Last Saved By: Dwight Schroot, Name of Creating Application: Microsoft Excel, Create Time/Date: Tue Sep 21 17:05:21 2010, Last Saved Time/Date: Wed Oct 13 16:52:14 2010, Security: 0

You need to save the custom formula in a text format on the .xls side. If your opening the .xls file from the internet this won't work but this will fix your problem if you can manipulate the file. You can do this using the function =TEXT(A2, "mm:ss.0") A2 is just the cell I'm using as an example.
book = ::Roo::Spreadsheet.open(file_location)
puts book.cell('B', 2)
=> '16.45.8'
If manipulating the file is not an option you could just pass a custom converter to CSV.new() and convert the decimal time back to the correct format you need.
require 'roo-xls'
require 'csv'
CSV::Converters[:time_parser] = lambda do |field, info|
case info[:header].strip
when "time" then begin
# 0.011641319444444444 * 24 hours * 3600 seconds = 1005.81
parse_time = field.to_f * 24 * 3600
# 1005.81.divmod(60) = [16, 45.809999999999999945]
mm, ss = parse_time.divmod(60)
# returns "16:45.81"
time = "#{mm}:#{ss.round(2)}"
time
rescue
field
end
else
field
end
end
book = ::Roo::Spreadsheet.open(file_location)
sheet = book.sheet(0)
csv = CSV.new(sheet.to_csv, headers: true, converters: [:time_parser]).map {|row| row.to_hash}
puts csv
=> {"time "=>"16:45.81"}
{"time "=>"12:46.0"}

Under the hood roo-xls gem uses the spreadsheet gem to parse the xls file. There was a similar issue to yours logged here, but it doesn't appear that there was any real resolution. Internally xls stores 16:45.81 as a Number and associates some formatting with it. I believe the issue has something to do with the spreadsheet gem not correctly handling the cell format.
I did try messing around with adding a format mm:ss.0 by following this guide but I couldn't get it to work, maybe you'll have more luck.

You can use converters option. It seems looking like this:
arr_of_arrs = CSV.parse(text, {converters: :date_time})
http://ruby-doc.org/stdlib-2.0.0/libdoc/csv/rdoc/CSV.html

Your problem seems to be with the way you're parsing (reading) the input file.
roo parses only Excel 2007-2013 (.xlsx) files. From you question, you want to parse .xls, which is a different format.
Like the documentation says, use the roo-xls gem instead.

Trouble Extracting Data with Nokogiri

I'm practicing extracting data from an XML site and I'm using Nokogiri to read and parse. I need to analyze the data but for now, I'm just trying to get an output with no success.
I have the following code:
require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open("http://www.ibiblio.org/xml/examples/shakespeare/macbeth.xml"))
doc.xpath('//PERSONA').each do |char_element|
puts char_element.text
end
I'm simply trying to read the characters off the XML website, but I'm not getting any results when I run it in the terminal. I also tried just writing a simple xpath call such as the one below:
doc.xpath("//PERSONA")
or
doc.xpath("PLAY TITLE")
And I get either an error or it simply acts as if nothing was entered.
I have put a simple function to test it so I know it's reading it. Can anyone tell me what I'm doing wrong?

You're trying to read XML file as a HTML one.
Please try that example:
doc = Nokogiri::XML(open("http://www.ibiblio.org/xml/examples/shakespeare/macbeth.xml"))
doc.xpath('//PERSONA').each{|ce| p ce.text }
"DUNCAN, king of Scotland."
"MALCOLM"
"DONALBAIN"
"MACBETH"
"BANQUO"
"MACDUFF"
"LENNOX"
"ROSS"
"MENTEITH"
"ANGUS"
"CAITHNESS"
"FLEANCE, son to Banquo."
"SIWARD, Earl of Northumberland, general of the English forces."
"YOUNG SIWARD, his son."
"SEYTON, an officer attending on Macbeth."
"Boy, son to Macduff. "
"An English Doctor. "
"A Scotch Doctor. "
"A Soldier."
"A Porter."
"An Old Man."
"LADY MACBETH"
"LADY MACDUFF"
"Gentlewoman attending on Lady Macbeth. "
"HECATE"
"Three Witches."
"Apparitions."
"Lords, Gentlemen, Officers, Soldiers, Murderers, Attendants, and Messengers. "
Please be sure you're using Nokogiri::XML instead of Nokogiri::HTML

Rails CSV file upload character encoding issues

I know there are a lot of threads about this already, but none of the solutions suggested do not seem to work for me for some reason...
I am using:
Ruby 1.9.2
Rails 2.3.8
My users author CSV files in MS Excel and then need to upload these files to the web application. My web application and the database backend uses UTF-8 and all special characters, such as the £ sign, get corrupted on upload.
I am reading in the file like this:
#file = params[:import_file][:uploaded_data]
Then get encoding of the file using:
source_encoding = "UTF-8"
if #file.external_encoding
source_encoding = #file.external_encoding.name
end
For my test file the source encoding value is ASCII-8BIT.
Then I try to do:
#file.each {|line|
print "#{line.force_encoding(source_encoding).encode!("UTF-8") }\n"
}
in order to see if all texts are displaying ok. However this gives me error like this:
"\xA3" from ASCII-8BIT to UTF-8
If I am trying to read the CSV with:
dataArray = CSV.read(#file, encoding: source_encoding)
No errors this time, but all special characters go as ? characters.
Any pointers where I might be going wrong or is importing CSV file authored with MS Excel just a mission impossible?
Regards,
Olli

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

How to Parse LARGE JSON files with formating error - ruby-on-rails

Related

Ruby: Is there a way to specify your encoding in File.write?

Read into Dask from Minio raises issue with reading / converting binary string JSON into utf8

How do I parse an Excel file that will give me data exactly as it appears visually?

Trouble Extracting Data with Nokogiri

Rails CSV file upload character encoding issues

Categories

Resources