Read multiple concatenated json objects in Ruby - ruby-on-rails

I have a file that contains multiple JSON objects that are not separated by comma :
{
"field" : "value",
"another_field": "another_value"
} // no comma
{
"field" : "value"
}
Each of the objects standalone is a valid json object.
Is there a way that I can process this file easily?
I know this is NOT a valid json, but unfortunately this file is being generated by a 3rd party tool. I have no option of changing the way the output looks like.
I can't open a text editor and smart-insert commas / square brackets before the run, since this is an automated process (I also really don't want to write code that opens the file and manipulates it).
In .NET there's a library that has this exact feature :
https://stackoverflow.com/a/29480032/2970729
https://www.newtonsoft.com/json/help/html/P_Newtonsoft_Json_JsonReader_SupportMultipleContent.htm
Is there anything equivalent in Ruby?

As long as your file is that simple you might want to do something like this:
# content = File.read(filename)
content =<<-EOF
{
"field" : "value",
"another_field": "another_value"
} // no comma
{
"field" : "value"
}
EOF
require 'json'
JSON.parse("[#{content.gsub(/\}.*?\{/m, '},{')}]")
#=> [{"field"=>"value", "another_field"=>"another_value"}, {"field"=>"value"}]

The yajl-ruby gem enables processing concatenated JSON in Ruby. The parser can read from a String or an IO. Each complete object is yielded to a block.
require 'yajl'
File.open 'file.json' do |f|
Yajl.load f do |object|
# do something with object
end
end
See the documentation for other options (buffer size, symbolized keys, etc).

Related

How to output JSON in Rails without escaping back slashes

I need to output some JSON for a customer in a somewhat unusual format. My app is written with Rails 5.
Desired JSON:
{
"key": "\/Date(0000000000000)\/"
}
The timestamp value needs to have a \/ at both the start and end of the string. As far as I can tell, this seems to be a format commonly used in .NET services. I'm stuck trying to get the slashes to output correctly.
I reduced the problem to a vanilla Rails 5 application with a single controller action. All the permutations of escapes I can think of have failed so far.
def index
render json: {
a: '\/Date(0000000000000)\/',
b: "\/Date(0000000000000)\/",
c: '\\/Date(0000000000000)\\/',
d: "\\/Date(0000000000000)\\/"
}
end
Which outputs the following:
{
"a": "\\/Date(0000000000000)\\/",
"b": "/Date(0000000000000)/",
"c": "\\/Date(0000000000000)\\/",
"d": "\\/Date(0000000000000)\\/"
}
For the sake of discussion, assume that the format cannot be changed since it is controlled by a third party.
I have uploaded a test app to Github to demonstrate the problem. https://github.com/gregawoods/test_app_ignore_me
After some brainstorming with coworkers (thanks #TheZanke), we came upon a solution that works with the native Rails JSON output.
WARNING: This code overrides some core behavior in ActiveSupport. Use at your own risk, and apply judicious unit testing!
We tracked this down to the JSON encoding in ActiveSupport. All strings eventually are encoded via the ActiveSupport::JSON.encode. We needed to find a way to short circuit that logic and simply return the unencoded string.
First we extended the EscapedString#to_json method found here.
module EscapedStringExtension
def to_json(*)
if starts_with?('noencode:')
"\"#{self}\"".gsub('noencode:', '')
else
super
end
end
end
module ActiveSupport::JSON::Encoding
class JSONGemEncoder
class EscapedString
prepend EscapedStringExtension
end
end
end
Then in the controller we add a noencode: flag to the json hash. This tells our version of to_json not to do any additional encoding.
def index
render json: {
a: '\/Date(0000000000000)\/',
b: 'noencode:\/Date(0000000000000)\/',
}
end
The rendered output shows that b gives us what we want, while a preserves the standard behavior.
$ curl http://localhost:3000/sales/index.json
{"a":"\\/Date(0000000000000)\\/","b":"\/Date(0000000000000)\/"}
Meditate on this:
Ruby treats forward-slashes the same in double-quoted and single-quoted strings.
"/" # => "/"
'/' # => "/"
In a double-quoted string "\/" means \ is escaping the following character. Because / doesn't have an escaped equivalent it results in a single forward-slash:
"\/" # => "/"
In a single-quoted string in all cases but one it means there's a back-slash followed by the literal value of the character. That single case is when you want to represent a backslash itself:
'\/' # => "\\/"
"\\/" # => "\\/"
'\\/' # => "\\/"
Learning this is one of the most confusing parts about dealing with strings in languages, and this isn't restricted to Ruby, it's something from the early days of programming.
Knowing the above:
require 'json'
puts JSON[{ "key": "\/value\/" }]
puts JSON[{ "key": '/value/' }]
puts JSON[{ "key": '\/value\/' }]
# >> {"key":"/value/"}
# >> {"key":"/value/"}
# >> {"key":"\\/value\\/"}
you should be able to make more sense of what you're seeing in your results and in the JSON output above.
I think the rules for this were originally created for C, so "Escape sequences in C" might help.
Hi I think this is the simplest way
.gsub("/",'//').gsub('\/','')
for input {:key=>"\\/Date(0000000000000)\\/"} (printed)
first gsub will do{"key":"\\//Date(0000000000000)\\//"}
second will get you
{"key":"\/Date(0000000000000)\/"}
as you needed

How to parse JSON with the Oj SAX parser, Saj

I want to parse a 10-20MB JSON file, and figure it's probably a good idea to not parse the entire JSON file at once and cause major memory usage. After looking around it seems like Oj's Saj or ScHandler APIs might be a good fit.
The only problem is that I can't really wrap my head around how to use them, and the documentation doesn't make it much clearer. I've looked at the example in Saj source code, and defined a super simple subclass of Oj::Saj like below:
class MySaj < Oj::Saj
def hash_start(key)
p key
end
end
Used like this:
open(URL) do |contents|
Oj.saj_parse(handler, contents)
end
And this leads to a lot of keys from my JSON being printed out. But I still have no idea how to actually access the values belonging to the keys I'm printing.
Can I access the hash itself somehow, or how am I supposed to do this?
SAX-style parsing is complicated. You have to maintain the state of the parsing, and deal with each state change appropriately.
The hash_start and array_start callbacks, notify your SAX handler that Saj has found the beginning of a hash, and that the next callbacks that occur will be in the context of that hash. Note that hashes may be nested, contain (or be contained within) arrays, or simple values.
Here is a simple Saj handler that parses a very simple JSON object:
require 'oj'
class MySaj < ::Oj::Saj
def initialize()
#hash_cnt = 0
#array_cnt = 0
end
def hash_start(key)
#hash_cnt += 1
puts "Start-Hash[#hash_cnt]: '#{key}'"
end
def hash_end(key)
#hash_cnt -= 1
puts "End-Hash[#hash_cnt]: '#{key}'"
end
def array_start(key)
#array_cnt += 1
puts "Start-Array[#array_cnt]: '#{key}'"
end
def array_end(key)
#array_cnt -= 1
puts "End-Array[#array_cnt]: '#{key}'"
end
def add_value(value, key);
puts "Value: [#{key}] = '#{value}'"
end
def error(message, line, column)
puts "ERRRORRR: #{line}:#{column}: #{message}"
end
end
json = '[{ "key1": "abc", "key2": 123}, { "test1": "qwerty", "pi": 3.14159 }]'
cnt = MySaj.new()
Oj.saj_parse(cnt, json)
The results of this basic JSON parsing with Saj gives this result:
Start-Array[#array_cnt]: ''
Start-Hash[#hash_cnt]: ''
Value: [key1] = 'abc'
Value: [key2] = '123'
End-Hash[#hash_cnt]: ''
Start-Hash[#hash_cnt]: ''
Value: [test1] = 'qwerty'
Value: [pi] = '3.14159'
End-Hash[#hash_cnt]: ''
End-Array[#array_cnt]: ''
You may notice that this output is roughly equivalent to one callback per token (omitting ',' and ':'). You essentially have to build into your callbacks the knowledge of what to do with individual JSON elements. Along those lines, you also need to build the hierarchy described by the callbacks. For example, when hash_start is called, push an empty hash on the stack; when hash_end is called, pop the hash or move back one level in the hierarchy.
For example you could have a handler in hash_end that checks to see if this is ending a top-level hash, and when it is, then do something with that hash. Note that you can often not do this with arrays, as the top-level element in a very large number of JSON documents is an array, so you have to determine when the array is the top+1 level array.
If you like writing compiler backends, this is the JSON parsing solution for you. Personally, I've never enjoyed working in Sax, but for large documents, it can be very resource-friendly and highly performant, depending on how well you write the handler. Be prepared for oodles of debugging and slightly mismatched state management, as that's par for the course with Sax-style parsing.
However, you shouldn't be too concerned with 10-20MB JSON, as that's actually not very large. I've processed 80+MB JSON with "regular" Oj (load and dump) quite a lot, and not had a problem with it. Unless you're running on a severely resource-constrained machine, the standard Oj will work well for you.
Saj is a streaming parser. What that means, in practice, is that it doesn't know a file's contents in their entirety and parses them whole — it instead notifies you of parse events as it encounters them. Your thinking is solid: the larger the file, the more you benefit from parsing in that manner if you wish to pick and choose from it.
hash_start is one such event, fired when Oj sees the beginning of an Object (which will become a Hash in Ruby land).
Take this JSON for instance:
{
"student-1": {
"name": "John Doe",
"age": 42,
"knownAliases": ["Blabby Joe", "Stack Underflow"],
"trainingGrades": {
"Advanced Zumba Dancing": "A+",
"Introduction to Twitter Arguments": "C-"
}
},
"student-2": {
"name": "Rebecca Melecca",
"age": 26,
"knownAliases": ["Booger Becca", "Tanktop Terror"],
"trainingGrades": {
"Intermediate Groin Kickery": "A+",
"Advanced Quantum Mechanics": "A+"
}
}
And the following parser:
class StudentParser < Oj::Saj
def hash_start(key)
puts "hash_start(#{key.inspect})"
end
def hash_end(key)
puts "hash_end(#{key.inspect})"
end
def array_start(key)
puts "array_start(#{key.inspect})"
end
def array_end(key)
puts "array_end(#{key.inspect})"
end
def add_value(value, key)
puts "add_value(#{value.inspect}, #{key.inspect})"
end
end
And you'll get the following sequence of events:
hash_start(nil)
hash_start("student-1")
add_value("John Doe", "name")
add_value(42, "age")
array_start("knownAliases")
add_value("Blabby Joe", nil)
add_value("Stack Underflow", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Advanced Zumba Dancing")
add_value("C-", "Introduction to Twitter Arguments")
hash_end("trainingGrades")
hash_end("student-1")
hash_start("student-2")
add_value("Rebecca Melecca", "name")
add_value(26, "age")
array_start("knownAliases")
add_value("Booger Becca", nil)
add_value("Tanktop Terror", nil)
array_end("knownAliases")
hash_start("trainingGrades")
add_value("A+", "Intermediate Groin Kickery")
add_value("A+", "Advanced Quantum Mechanics")
hash_end("trainingGrades")
hash_end("student-2")
hash_end(nil)
When you see hash_start(nil), it means the parser has found a top-level object (that very first opening brace). Conversely, hash_end(nil) means that top-level object has been closed, and its innards properly parsed (i.e. no parsing erros have been found).
Parsing in this manner means you have to keep track of nesting, if that's meaningful to you, of adding keys and values at the right value, et cetera. That makes it annoying and hard, but worthwhile if you wish to carve out bits of a large file without committing everything to memory.

Is it possible to parse dynamic xml-structured log contents with Grok?

Is it feasible using Grok to parse dynamic xml-structured log contents, such as:
<tag_1> contents </tag_1> ... <tag_N> contents </tag_N>
where "tag_*" would be the field name and "contents" - the actual contents.
Therefore the parsed message would look like:
{
"tag_1": [
[
"contents"
]
],
....
"tag_N": [
[
"contents"
]
]
}
Not with grok. You will need to resort to ruby code to parse the XML and toss it into the event structure.
If your XML is super regular (ie has a root element and only one level under it), you could maybe use code like this:
filter {
ruby {
code => "
msg = event['message'].split('><');
for part in msg
endpos = part.index('</')
startpos = part.index('>')
if !endpos.nil? && !startpos.nil? then
tag = part[0,startpos];
text = part[startpos+1,endpos-startpos-1];
event[tag]=text
end
end
"
}
}
If your xml is more complex, you are going to have to resort to a real XML parser and figure out how to use it with logstash (I've never brought an external library into logstash).

Preprocessing Scala parser Reader input

I have a file containing a text representation of an object. I have written a combinator parser grammar that parses the text and returns the object. In the text, "#" is a comment delimiter: everything from that character to the end of the line is ignored. Blank lines are also ignored. I want to process text one line at a time, so that I can handle very large files.
I don't want to clutter up my parser grammar with generic comment and blank line logic. I'd like to remove these as a preprocessing step. Converting the file to an iterator over line I can do something like this:
Source.fromFile("file.txt").getLines.map(_.replaceAll("#.*", "").trim).filter(!_.isEmpty)
How can I pass the output of an expression like that into a combinator parser? I can't figure out how to create a Reader object out of a filtered expression like this. The Java FileReader interface doesn't work that way.
Is there a way to do this, or should I put my comment and blank line logic in the parser grammar? If the latter, is there some util.parsing package that already does this for me?
The simplest way to do this is to use the fromLines method on PagedSeq:
import scala.collection.immutable.PagedSeq
import scala.io.Source
import scala.util.parsing.input.PagedSeqReader
val lines = Source.fromFile("file.txt").getLines.map(
_.replaceAll("#.*", "").trim
).filterNot(_.isEmpty)
val reader = new PagedSeqReader(PagedSeq.fromLines(lines))
And now you've got a scala.util.parsing.input.Reader that you can plug into your parser. This is essentially what happens when you parse a java.io.Reader, anyway—it immediately gets wrapped in a PagedSeqReader.
Not the prettiest code you'll ever write, but you could go through a new Source as follows:
val SEP = System.getProperty("line.separator")
def lineMap(fileName : String, trans : String=>String) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line) + SEP
).toIterable
)
}
Explanation: flatMap will produce an iterator on characters, which you can turn into an Iterable, which you can use to build a new Source. You need the extra SEP because getLines removes it by default (using \n may not work as Source will not properly separate the lines).
If you want to apply filtering too, i.e. remove some of the lines, you could for instance try:
// whenever `trans` returns `None`, the line is dropped.
def lineMapFilter(fileName : String, trans : String=>Option[String]) : Source = {
Source.fromIterable(
Source.fromFile(fileName).getLines.flatMap(
line => trans(line).map(_ + SEP).getOrElse("")
).toIterable
)
}
As an example:
lineMapFilter("in.txt", line => if(line.isEmpty) None else Some(line.reverse))
...will remove empty lines and reverse non-empty ones.

Extracting JSON objects from JSON string

I want to break down a JSON string into smaller objects. I have two servers, one acting as the web-app interface to the whole application and the other is a repository/database.
I'm able to retrieve information from the repository to the web-app as JSON, but after that I don't know how to return it.
Here's a sample of the JSON being returned:
{"respPages":[{"page":{"page_url":"http://www.google.com/","created_at":"2011-08-10T11:00:19Z","website_id":1,"updated_at":"2011-08-10T11:00:19Z","id":1}},{"page":{"page_url":"http://www.blank.com/services/content_services/","created_at":"2011-08-10T11:02:46Z","website_id":1,"updated_at":"2011-08-10T11:02:46Z","id":2}}],"respSite":{"website":{"created_at":"2011-08-10T11:00:19Z","website_id":null,"updated_at":"2011-08-10T11:00:19Z","website_url":null,"id":1}},"respElementTypes":[{"element_type":{"created_at":"2011-08-10T11:00:19Z","updated_at":"2011-08-10T11:00:19Z","id":1,"tag_name":"head"}},
There are four tags in the JSON:
page
website
elementType
elementData
I would like to create four arrays and populate them with the object that matches these tags.
I would image the code is something like this:
#Get the json from repo using net/http
uri = URI.parse("http://127.0.0.1:3007/repository/infoid/1.json")
http = Net::HTTP.new(uri.host, uri.port)
response = http.request(Net::HTTP::Get.new(uri.request_uri))
#x = response.to_hash
#pages = Array.new
#websites= Array.new
#elementDatas = Array.new
#elementTypes = Array.new
#enter code here`#For every bit of the hash, find out what it is and allocate it accordingly
#x.each_with_index do |e,index|
if e.tagName == pages #Getting real javascripty here. There must be someway to check the tag or title of the element
#pages[index]=e
end
My goal for the returned value is to have four arrays, each containing different types of objects:
#pagesArray[1]
should contain the first occurrence of a page object in the JSON string. Then do the same for the other ones.
Of course I'd need to break down the object further but once I can break down the top level and categorize them, then I can go deeper.
In the JSON there are already tag titles respPages and respWebsites which group all the objects.
How do I turn JSON back into objects in Ruby and reference them using something like the tag name?
You should be able to decode anything in JSON format using the standard JSON library:
JSON.load(...)
It will throw exceptions on malformed JSON data, so be sure to test it thoroughly and make sure it can handle all the important cases.
If you're trying to navigate the structure of the JSON itself, you probably need to write a series of recursive methods that handle each case along the way. A good pattern to start with is this:
#data.each do |key, value|
case (key)
when 'someKey'
handle_some_key(value)
when 'otherKey'
handle_other_key(value)
end
end
You can either break out the behavior into methods as in this example, or inline it if the logic is fairly straightforward.
As a note, an alternative to Array.new is simply [ ] as it is in JavaScript. For example:
#pages = [ ]
You'll see this used frequently in most Ruby examples. The alternative to Hash.new is { }.
The following works:
json = {"respPages"=>[{"page"=>{"page_url"=>"http://www.google.com", "created_at"=>"2011-08-10T11:00:19Z", "website_id"=>1, "updated_at"=>"2011-08-10T11:00:19Z", "id"=>1}}, {"page"=>{"page_url"=>"http://www.blank.com/services/content_services/", "created_at"=>"2011-08-10T11:02:46Z", "website_id"=>1, "updated_at"=>"2011-08-10T11:02:46Z", "id"=>2}}],
"respSite"=>{"website"=>{"created_at"=>"2011-08-10T11:00:19Z", "website_id"=>nil, "updated_at"=>"2011-08-10T11:00:19Z", "website_url"=>nil, "id"=>1}},
"respElementTypes"=>[{"element_type"=>{"created_at"=>"2011-08-10T11:00:19Z", "updated_at"=>"2011-08-10T11:00:19Z", "id"=>1, "tag_name"=>"head"}}]}
#respPages, #respSite, #respElementTypes = [], [], []
json.each do |key_category, group_category|
group_category.each do |hash|
if group_category.is_a? Array
eval("##{key_category}") << hash.values.first
elsif group_category.is_a? Hash
eval("##{key_category}") << hash[1]
end
end
end
there weren't any respData in your sample but you've got the idea.

Resources