I want to parse a 10-20MB JSON file, and figure it's probably a good idea to not parse the entire JSON file at once and cause major memory usage. After looking around it seems like Oj's Saj or ScHandler APIs might be a good fit.
The only problem is that I can't really wrap my head around how to use them, and the documentation doesn't make it much clearer. I've looked at the example in Saj source code, and defined a super simple subclass of Oj::Saj like below:
class MySaj < Oj::Saj
def hash_start(key)
p key
end
end
Used like this:
open(URL) do |contents|
Oj.saj_parse(handler, contents)
end
And this leads to a lot of keys from my JSON being printed out. But I still have no idea how to actually access the values belonging to the keys I'm printing.
Can I access the hash itself somehow, or how am I supposed to do this?
SAX-style parsing is complicated. You have to maintain the state of the parsing, and deal with each state change appropriately.
The hash_start and array_start callbacks, notify your SAX handler that Saj has found the beginning of a hash, and that the next callbacks that occur will be in the context of that hash. Note that hashes may be nested, contain (or be contained within) arrays, or simple values.
Here is a simple Saj handler that parses a very simple JSON object:
require 'oj'
class MySaj < ::Oj::Saj
def initialize()
#hash_cnt = 0
#array_cnt = 0
end
def hash_start(key)
#hash_cnt += 1
puts "Start-Hash[#hash_cnt]: '#{key}'"
end
def hash_end(key)
#hash_cnt -= 1
puts "End-Hash[#hash_cnt]: '#{key}'"
end
def array_start(key)
#array_cnt += 1
puts "Start-Array[#array_cnt]: '#{key}'"
end
def array_end(key)
#array_cnt -= 1
puts "End-Array[#array_cnt]: '#{key}'"
end
def add_value(value, key);
puts "Value: [#{key}] = '#{value}'"
end
def error(message, line, column)
puts "ERRRORRR: #{line}:#{column}: #{message}"
end
end
json = '[{ "key1": "abc", "key2": 123}, { "test1": "qwerty", "pi": 3.14159 }]'
cnt = MySaj.new()
Oj.saj_parse(cnt, json)
The results of this basic JSON parsing with Saj gives this result:
Start-Array[#array_cnt]: ''
Start-Hash[#hash_cnt]: ''
Value: [key1] = 'abc'
Value: [key2] = '123'
End-Hash[#hash_cnt]: ''
Start-Hash[#hash_cnt]: ''
Value: [test1] = 'qwerty'
Value: [pi] = '3.14159'
End-Hash[#hash_cnt]: ''
End-Array[#array_cnt]: ''
You may notice that this output is roughly equivalent to one callback per token (omitting ',' and ':'). You essentially have to build into your callbacks the knowledge of what to do with individual JSON elements. Along those lines, you also need to build the hierarchy described by the callbacks. For example, when hash_start is called, push an empty hash on the stack; when hash_end is called, pop the hash or move back one level in the hierarchy.
For example you could have a handler in hash_end that checks to see if this is ending a top-level hash, and when it is, then do something with that hash. Note that you can often not do this with arrays, as the top-level element in a very large number of JSON documents is an array, so you have to determine when the array is the top+1 level array.
If you like writing compiler backends, this is the JSON parsing solution for you. Personally, I've never enjoyed working in Sax, but for large documents, it can be very resource-friendly and highly performant, depending on how well you write the handler. Be prepared for oodles of debugging and slightly mismatched state management, as that's par for the course with Sax-style parsing.
However, you shouldn't be too concerned with 10-20MB JSON, as that's actually not very large. I've processed 80+MB JSON with "regular" Oj (load and dump) quite a lot, and not had a problem with it. Unless you're running on a severely resource-constrained machine, the standard Oj will work well for you.
Saj is a streaming parser. What that means, in practice, is that it doesn't know a file's contents in their entirety and parses them whole — it instead notifies you of parse events as it encounters them. Your thinking is solid: the larger the file, the more you benefit from parsing in that manner if you wish to pick and choose from it.
hash_start is one such event, fired when Oj sees the beginning of an Object (which will become a Hash in Ruby land).
Take this JSON for instance:
{
"student-1": {
"name": "John Doe",
"age": 42,
"knownAliases": ["Blabby Joe", "Stack Underflow"],
"trainingGrades": {
"Advanced Zumba Dancing": "A+",
"Introduction to Twitter Arguments": "C-"
}
},
"student-2": {
"name": "Rebecca Melecca",
"age": 26,
"knownAliases": ["Booger Becca", "Tanktop Terror"],
"trainingGrades": {
"Intermediate Groin Kickery": "A+",
"Advanced Quantum Mechanics": "A+"
}
}
And the following parser:
class StudentParser < Oj::Saj
def hash_start(key)
puts "hash_start(#{key.inspect})"
end
def hash_end(key)
puts "hash_end(#{key.inspect})"
end
def array_start(key)
puts "array_start(#{key.inspect})"
end
def array_end(key)
puts "array_end(#{key.inspect})"
end
def add_value(value, key)
puts "add_value(#{value.inspect}, #{key.inspect})"
end
end
And you'll get the following sequence of events:
hash_start(nil)
hash_start("student-1")
add_value("John Doe", "name")
add_value(42, "age")
array_start("knownAliases")
add_value("Blabby Joe", nil)
add_value("Stack Underflow", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Advanced Zumba Dancing")
add_value("C-", "Introduction to Twitter Arguments")
hash_end("trainingGrades")
hash_end("student-1")
hash_start("student-2")
add_value("Rebecca Melecca", "name")
add_value(26, "age")
array_start("knownAliases")
add_value("Booger Becca", nil)
add_value("Tanktop Terror", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Intermediate Groin Kickery")
add_value("A+", "Advanced Quantum Mechanics")
hash_end("trainingGrades")
hash_end("student-2")
hash_end(nil)
When you see hash_start(nil), it means the parser has found a top-level object (that very first opening brace). Conversely, hash_end(nil) means that top-level object has been closed, and its innards properly parsed (i.e. no parsing erros have been found).
Parsing in this manner means you have to keep track of nesting, if that's meaningful to you, of adding keys and values at the right value, et cetera. That makes it annoying and hard, but worthwhile if you wish to carve out bits of a large file without committing everything to memory.
Related
This is code I have using in my project.
Please suggest some optimizations (I have refactored this code a lot but I can't think of any progress further to optimize it )
def convert_uuid_to_emails(user_payload)
return unless (user_payload[:target] == 'ticket' or user_payload[:target] == 'change')
action_data = user_payload[:actions]
action_data.each do |data|
is_add_project = data[:name] == 'add_fr_project'
is_task = data[:name] == 'add_fr_task'
next unless (is_add_project or is_task)
has_reporter_uuid = is_task && Va::Action::USER_TYPES.exclude?(data[:reporter_uuid])
user_uuids = data[:user_uuids] || []
user_uuids << data[:owner_uuid] if Va::Action::USER_TYPES.exclude?(data[:owner_uuid])
user_uuids << data[:reporter_uuid] if has_reporter_uuid
users_data = current_account.authorizations.includes(:user).where(uid: user_uuids).each_with_object({}) { |a, o| o[a.uid] = {uuid: a.uid, user_id: a.user.id, user_name: a.user.name} }
if Va::Action::USER_TYPES.include? data[:owner_uuid]
data['owner_details'] = {}
else
data['owner_details'] = users_data[data[:owner_uuid]]
users_data.delete(data[:owner_uuid])
end
data['reporter_details'] = has_reporter_uuid ? users_data[data[:reporter_uuid]] : {}
data['user_details'] = users_data.values
end
end
Note that Rubocop is complaining that your code is too hard to understand, not that it won't work correctly. The method is called convert_uuid_to_emails, but it doesn't just do that:
validates payload is one of two types
filters the items in the payload by two other types
determines the presence of various user roles in the input
shove all the found user UUIDs into an array
convert the UUIDs into users by looking them up
find them again in the array to enrich the various types of user details in the payload
This comes down to a big violation of the SRP (single responsibility principle), not to mention that it is a method that might surprise the caller with its unexpected list of side effects.
Obviously, all of these steps still need to be done, just not all in the same method.
Consider breaking these steps out into separate methods that you can compose into an enrich_payload_data method that works at a higher level of abstraction, keeping the details of how each part works local to each method. I would probably create a method that takes a UUID and converts it to a user, which can be called each time you need to look up a UUID to get the user details, as this doesn't appear to be role-specific.
The booleans is_task, is_add_project, and has_reporter_uuid are just intermediate variables that clutter up the code, and you probably won't need them if you break it down into smaller methods.
I am new to Rails, and working with some JSON, and not sure how to get to the data as the examples below:
1) If i were to use JSON.parse(response)['Response']['test']['data']['123456'], i will need to parse another response for 123457, is there a better way to loop through all the objects in data?
2) base on the membershipId, identify the top level object, ie data.
"test": {
"data": {
"123456": {
"membershipId": "321321312",
"membershipType": a,
},
"123457": {
"membershipId": "321321312",
"membershipType": a,
},
}
JSON.parse(response)['Response']['test']['data'].each do |key, object|
puts key
puts object['membershipID']
...
end
To select the data record associated with a particular membership
match_membership = '321321312'
member = JSON.parse(response)['Response']['test']['data'].select |_key, object|
object['membershipID'] == match_membership
end
puts member.key
=> 123456
For 1:
Assumption:
By you saying "need to parse another response", you were doing something like below:
# bad code: because you are parsing `response` multiple times
JSON.parse(response)['Response']['test']['data']['123456']
JSON.parse(response)['Response']['test']['data']['123457']
then simply:
Solution 1:
If you are gonna be accessing 2+ level deep hash values for just maybe 2 or 3 times,
response_hash = JSON.parse(response)
response_hash['Response']['test']['data']['123456']
response_hash['Response']['test']['data']['123457']
Solution 2:
If you are gonna be accessing 2+ level deep hash values for loooooots of times,
response_hash = JSON.parse(response)
response_hash_response_test_data = response_hash['Response']['test']['data']
response_hash_response_test_data['123456']
response_hash_response_test_data['123457']
response_hash_response_test_data['123458']
response_hash_response_test_data['123459']
response_hash_response_test_data['123460']
# ...
Solution 2 is better than Solution 1 because it saves repetitive method calls for Hash#[] which is the "getter" method each time you do like ...['test'] then ['data'] then ['123456'], and so is better-off doing Solution 2 which you store the nested-level of the hash into a variable (this does not duplicate the values in-memory!). Plus it's more readable this way.
I need to output some JSON for a customer in a somewhat unusual format. My app is written with Rails 5.
Desired JSON:
{
"key": "\/Date(0000000000000)\/"
}
The timestamp value needs to have a \/ at both the start and end of the string. As far as I can tell, this seems to be a format commonly used in .NET services. I'm stuck trying to get the slashes to output correctly.
I reduced the problem to a vanilla Rails 5 application with a single controller action. All the permutations of escapes I can think of have failed so far.
def index
render json: {
a: '\/Date(0000000000000)\/',
b: "\/Date(0000000000000)\/",
c: '\\/Date(0000000000000)\\/',
d: "\\/Date(0000000000000)\\/"
}
end
Which outputs the following:
{
"a": "\\/Date(0000000000000)\\/",
"b": "/Date(0000000000000)/",
"c": "\\/Date(0000000000000)\\/",
"d": "\\/Date(0000000000000)\\/"
}
For the sake of discussion, assume that the format cannot be changed since it is controlled by a third party.
I have uploaded a test app to Github to demonstrate the problem. https://github.com/gregawoods/test_app_ignore_me
After some brainstorming with coworkers (thanks #TheZanke), we came upon a solution that works with the native Rails JSON output.
WARNING: This code overrides some core behavior in ActiveSupport. Use at your own risk, and apply judicious unit testing!
We tracked this down to the JSON encoding in ActiveSupport. All strings eventually are encoded via the ActiveSupport::JSON.encode. We needed to find a way to short circuit that logic and simply return the unencoded string.
First we extended the EscapedString#to_json method found here.
module EscapedStringExtension
def to_json(*)
if starts_with?('noencode:')
"\"#{self}\"".gsub('noencode:', '')
else
super
end
end
end
module ActiveSupport::JSON::Encoding
class JSONGemEncoder
class EscapedString
prepend EscapedStringExtension
end
end
end
Then in the controller we add a noencode: flag to the json hash. This tells our version of to_json not to do any additional encoding.
def index
render json: {
a: '\/Date(0000000000000)\/',
b: 'noencode:\/Date(0000000000000)\/',
}
end
The rendered output shows that b gives us what we want, while a preserves the standard behavior.
$ curl http://localhost:3000/sales/index.json
{"a":"\\/Date(0000000000000)\\/","b":"\/Date(0000000000000)\/"}
Meditate on this:
Ruby treats forward-slashes the same in double-quoted and single-quoted strings.
"/" # => "/"
'/' # => "/"
In a double-quoted string "\/" means \ is escaping the following character. Because / doesn't have an escaped equivalent it results in a single forward-slash:
"\/" # => "/"
In a single-quoted string in all cases but one it means there's a back-slash followed by the literal value of the character. That single case is when you want to represent a backslash itself:
'\/' # => "\\/"
"\\/" # => "\\/"
'\\/' # => "\\/"
Learning this is one of the most confusing parts about dealing with strings in languages, and this isn't restricted to Ruby, it's something from the early days of programming.
Knowing the above:
require 'json'
puts JSON[{ "key": "\/value\/" }]
puts JSON[{ "key": '/value/' }]
puts JSON[{ "key": '\/value\/' }]
# >> {"key":"/value/"}
# >> {"key":"/value/"}
# >> {"key":"\\/value\\/"}
you should be able to make more sense of what you're seeing in your results and in the JSON output above.
I think the rules for this were originally created for C, so "Escape sequences in C" might help.
Hi I think this is the simplest way
.gsub("/",'//').gsub('\/','')
for input {:key=>"\\/Date(0000000000000)\\/"} (printed)
first gsub will do{"key":"\\//Date(0000000000000)\\//"}
second will get you
{"key":"\/Date(0000000000000)\/"}
as you needed
When a user uses my application, at one point they will get an array of arrays, that looks like this:
results = [["value",25], ["value2",30]...]
The sub arrays could be larger, and will be in a similar format. I want to allow my users to write their own custom transform function that will take an array of arrays, and return either an array of arrays, a string, or a number. A function should look like this:
def user_transform_function(array_of_arrays)
# eval users code, only let them touch the array of arrays
end
Is there a safe way to sandbox this function and eval so a user could not try and execute malicious code? For example, no web callouts, not database callouts, and so on.
First, if you will use eval, it will never be safe. You can at least have a look in the direction of taint method.
What I would recommend is creating your own DSL for that. There is a great framework in Ruby: http://treetop.rubyforge.org/index.html. Of course, it will require some effort from your side, but from the user prospective I think it could be even better.
WARNING: I can not guarantee that this is truly safe!
You might be able to run it as a separate process and use ruby $SAFE, however this does not guarantee that what you get is safe, but it makes it harder to mess things up.
What you then would do is something like this:
script = "arr.map{|e| e+2}" #from the user.
require "json"
array = [1, 2, 3, 4]
begin
results = IO.popen("ruby -e 'require \"json\"; $SAFE=3; arr = JSON.parse(ARGV[0]); puts (#{script}).to_json' #{array.to_json}") do |io|
io.read
end
rescue Exception => e
puts "Ohh, good Sir/Mam, your script caused an error."
end
if results.include?("Insecure operation")
puts "Ohh, good Sir/Mam, you cannot do such a thing"
else
begin
a = JSON.parse(results)
results = a
rescue Exception => e
puts "Ohh, good Sir/Mam, something is wrong with the results."
puts results
end
end
conquer_the_world(results) if results.is_a?(Array)
do_not_conquer_the_world(results) unless results.is_a?(Array)
OR
You could do this, it appears:
def evaluate_user_script(script)
Thread.start {
$SAFE = 4
eval(script)
}
end
But again: I do not know how to get the data out of there.
I want to break down a JSON string into smaller objects. I have two servers, one acting as the web-app interface to the whole application and the other is a repository/database.
I'm able to retrieve information from the repository to the web-app as JSON, but after that I don't know how to return it.
Here's a sample of the JSON being returned:
{"respPages":[{"page":{"page_url":"http://www.google.com/","created_at":"2011-08-10T11:00:19Z","website_id":1,"updated_at":"2011-08-10T11:00:19Z","id":1}},{"page":{"page_url":"http://www.blank.com/services/content_services/","created_at":"2011-08-10T11:02:46Z","website_id":1,"updated_at":"2011-08-10T11:02:46Z","id":2}}],"respSite":{"website":{"created_at":"2011-08-10T11:00:19Z","website_id":null,"updated_at":"2011-08-10T11:00:19Z","website_url":null,"id":1}},"respElementTypes":[{"element_type":{"created_at":"2011-08-10T11:00:19Z","updated_at":"2011-08-10T11:00:19Z","id":1,"tag_name":"head"}},
There are four tags in the JSON:
page
website
elementType
elementData
I would like to create four arrays and populate them with the object that matches these tags.
I would image the code is something like this:
#Get the json from repo using net/http
uri = URI.parse("http://127.0.0.1:3007/repository/infoid/1.json")
http = Net::HTTP.new(uri.host, uri.port)
response = http.request(Net::HTTP::Get.new(uri.request_uri))
#x = response.to_hash
#pages = Array.new
#websites= Array.new
#elementDatas = Array.new
#elementTypes = Array.new
#enter code here`#For every bit of the hash, find out what it is and allocate it accordingly
#x.each_with_index do |e,index|
if e.tagName == pages #Getting real javascripty here. There must be someway to check the tag or title of the element
#pages[index]=e
end
My goal for the returned value is to have four arrays, each containing different types of objects:
#pagesArray[1]
should contain the first occurrence of a page object in the JSON string. Then do the same for the other ones.
Of course I'd need to break down the object further but once I can break down the top level and categorize them, then I can go deeper.
In the JSON there are already tag titles respPages and respWebsites which group all the objects.
How do I turn JSON back into objects in Ruby and reference them using something like the tag name?
You should be able to decode anything in JSON format using the standard JSON library:
JSON.load(...)
It will throw exceptions on malformed JSON data, so be sure to test it thoroughly and make sure it can handle all the important cases.
If you're trying to navigate the structure of the JSON itself, you probably need to write a series of recursive methods that handle each case along the way. A good pattern to start with is this:
#data.each do |key, value|
case (key)
when 'someKey'
handle_some_key(value)
when 'otherKey'
handle_other_key(value)
end
end
You can either break out the behavior into methods as in this example, or inline it if the logic is fairly straightforward.
As a note, an alternative to Array.new is simply [ ] as it is in JavaScript. For example:
#pages = [ ]
You'll see this used frequently in most Ruby examples. The alternative to Hash.new is { }.
The following works:
json = {"respPages"=>[{"page"=>{"page_url"=>"http://www.google.com", "created_at"=>"2011-08-10T11:00:19Z", "website_id"=>1, "updated_at"=>"2011-08-10T11:00:19Z", "id"=>1}}, {"page"=>{"page_url"=>"http://www.blank.com/services/content_services/", "created_at"=>"2011-08-10T11:02:46Z", "website_id"=>1, "updated_at"=>"2011-08-10T11:02:46Z", "id"=>2}}],
"respSite"=>{"website"=>{"created_at"=>"2011-08-10T11:00:19Z", "website_id"=>nil, "updated_at"=>"2011-08-10T11:00:19Z", "website_url"=>nil, "id"=>1}},
"respElementTypes"=>[{"element_type"=>{"created_at"=>"2011-08-10T11:00:19Z", "updated_at"=>"2011-08-10T11:00:19Z", "id"=>1, "tag_name"=>"head"}}]}
#respPages, #respSite, #respElementTypes = [], [], []
json.each do |key_category, group_category|
group_category.each do |hash|
if group_category.is_a? Array
eval("##{key_category}") << hash.values.first
elsif group_category.is_a? Hash
eval("##{key_category}") << hash[1]
end
end
end
there weren't any respData in your sample but you've got the idea.