Is there any way to parse JSON with trailing commas in Ruby? - ruby-on-rails

I'm currently coding a transition from a system that used hand-crafted JSON files to one that can automatically generate the JSON files. The old system works; the new system works; what I need to do is transfer data from the old system to the new one.
The JSON files are used by an iOS app to provide functionality, and have never been read by our server software in Ruby On Rails before. To convert between the original system and the new system, I've started work on parsing the existing JSON files.
The problem is that one of my first two sample files has trailing commas in the JSON:
{ "sample data": [1, 2, 3,] }
This apparently went through just fine with the iOS app, because that file has been in use for a while. Now I need some way to parse the data provided in the file in my Ruby on Rails server, which (quite rightfully) throws an exception over the illegal trailing comma in the JSON file.
I can't just JSON.parse the code, because the parser, quite rightfully, rejects it as invalid JSON. Is there some way to parse it -- either an option I can pass to JSON.parse, or a gem that adds something, etc etc? Or do I need to report back that we're going to have to hand-fix the broken files before the automated process can process them?
Edit:
Based on comments and requests, it looks like some additional data is called for. The JSON files in question are stored in .zip files on S3, stored via ActiveStorage. The process I'm writing needs to download, unpack, and parse the zip files, using the 'manifest.json' file as a key to convert the archived file into a database structure with multiple, smaller files stored on S3 instead of a single zip that contains everything. A (very) long term goal is for clients to stop downloading a unitary zip file, and instead download the files individually. The first step towards that is to break the zip files up on the server, which means the server needs to read in the zip files. A more detailed sample of the data follows. (Note that the structure contains several design decisions I later came to regret; one of the original ideas was to be able to re-use files rather than pack multiple copies of the same identical file, but YAGNI bit me in the rear there)
The following includes comments that are not legal in JSON format:
{
"defined_key": [
{
"name": "Object_with_subkeys",
"key": "filename",
"subkeys": [
{
"id":"1"
},
{
"id":"2"
},
{
"id":"3" // references to identifier on another defined key
}, // Note trailing comma
]
}
],
"another_defined_key":[
{
"identifier": "should have made parent a hash with id as key instead of an array",
"data":"metadata",
"display_name":"Names: Can be very arbitrary",
"user text":"Wait for the right {moment}", // I actually don't expect { or } in the strings, but they're completely legal and may have been used
"thumbnail":"filename-2.png",
"video-1":"filename-3.mov"
}
]
}

The problem is that your are trying to parse something that looks a lot like JSON but is not actually JSON as defined by the spec.
Arrays- An array structure is a pair of square bracket tokens surrounding zero or more values. The values are separated by commas.
Since you have a trailing comma another value is also expected and most JSON parsers will raise an error due to this violation
All that being said json-next will parse this appropriately maybe give that a shot.
It can parse JSON like representations that completely violate the JSON spec depending on the flavor you use. (HanSON, SON, JSONX as defined in the gem)
Example:
json = "{ \"sample data\": [1, 2, 3,] }")
require 'json/next'
HANSON.parse(json)
#=> {"sample data"=>[1, 2, 3]}
but the following is equivalent and completely violates spec
JSONX.parse("{ \"sample data\": [1 2 3] }")
#=> {"sample data"=>[1, 2, 3]}
So if you choose this route do not expect to use this to validate the JSON data or structure in any fashion and you could end up with unintended results.

Related

Use Annotation tool configuration / Automatic annotation service from brat

I'd like to use a personnal API for named entity recognition (NER), and use brat for visualisation. It seems brat offers an Automatic annotation tool, but documentation about its configuration is sparse.
Are there available working examples of this features ?
Could someone explain me what should be the format of the response of the API ?
I finally manage to understand how it works, thanks to this topic in the GoogleGroup diffusion list of BRAT
https://groups.google.com/g/brat-users/c/shX1T2hqzgI
The text is sent to the Automatic Annotator API as a byte string in the body of a POST request, and the format BRAT required in response from this API is in the form of a dictionary of dictionaries, namel(
{
"T1": {
"type": "WhatEverYouWantString", # must be defined in the annotation.conf file
"offsets": [(0, 2), (10, 12)], # list of tuples of integers that correspond to the start and end position of
"texts": ["to", "go"]
}
"T2" : {
"type": "SomeString",
"offsets":[(start1, stop1), (start2, stop2), ...]
"texts":["string[start1:stop1]", "string[start2:stop2]", ...
}
"T3" : ....
}
THEN, you put this dictionary in a JSON format and you send it back to BRAT.
Note :
"T1", "T2", ... are mandatory keys (and corresponds to the Term index in the .ann file that BRAT generates during manual annotation)
the keys "type", "offsets" and "texts" are mandatory, otherwise you get some error in the log of BRAT (you can consult these log as explained in the GoogleGroup thread linked above)
the format of the values are strict ("type" gets a string, "offsets" gets a list of tuple (or list) or integers, "texts" gets a list of strings), otherwise you get BRAT errors
I suppose that the strings in "texts" must corresponds to the "offsets", otherwise there should be an error, or at least a problem with the display of tags (this is already the case if you generate the .ann files from an automatic detection algorithm and have different start and stop than the associated text)
I hope it helps. I managed to make the API using Flask this morning, but I needed to construct a flask.Response object to get the correct output format. Also, the incoming format from BRAT to the Flask API could not be catch until I used a flask.request object with request.get_body() method.
Also, I have to mention that I was not able to use the examples given in the BRAT GitHub :
https://github.com/nlplab/brat/blob/master/tools/tokenservice.py
https://github.com/nlplab/brat/blob/master/tools/randomtaggerservice.py
I mean I could not make them working, but I'm not familiar at all with API and HTTP packages in Python. At least I figured out what was the correct format for the API response.
Finally, I have no idea how to make relations among entities (i.e. BRAT arrows) format from the API, though
https://github.com/nlplab/brat/blob/master/tools/restoataggerservice.py
seems to work with such thing.
The GoogleGroup discussion
https://groups.google.com/g/brat-users/c/lzmd2Nyyezw/m/CMe9FenZAAAJ
seems to mention that it is not possible to send relations between entities back from the Automatic Annotation API and make them work with BRAT.
I may try it later :-)

Having a POJO like feature in KarateAPI?

I have been using Karate and RestAssured for sometime. There are advantages and downside of both tools of course. Right now I have a RestAssured project where I have Request and Response object and POJOs. My requests wraps my endpoint and send my POJOs to those endpoint. I do all my Headers, etc configuration in an abstract layer. In case I need to override them, I override them during the test. If not, Its a two lines of code for me to trigger an endpoint.
My way of working with happy path and negative path of an edpoint is that I initialize the POJO before every test with new values in the constructor. Then I override the value that I want in test scope. For example, if I want to test a negative case for password field, I set this field to empty string during the test. But other fields are already set to some random stuff before the test.
But I dont know how to achieve this with Karate.
Karate allows me to create a JSON representation of my request body and define my parameters as seen below example.
{
"firstName": "<name>",
"lastName": "<lastName>",
"email": "<email>",
"role": <role>
}
Then in every test I have to fill all the fields with some data.
|token |value|
|name |'canberk'|
|lastName |''|
|email |'canberk#blbabla.com'|
|role |'1'|
and
|token |value|
|name |''|
|lastName |'akduygu'|
|email |'canberk#blbabla.com'|
|role |'1'|
It goes on like this.
It's ok with a 4 fields JSON body but when the body starts to have more than 20 fields, it become a pain to initialise every field for every test.
Does Karate have a way of achieving this problem with a predefined steps of I need to come up with a solution?
There are advantages and downside of both tools of course.
I'm definitely biased, but IMHO the only disadvantage of Karate compared to REST-assured is that you don't get compile time safety :) I hope that you have seen this comparison.
Karate has multiple ways to do what you want. Here's what I would do.
create a JSON file that has all your "happy path" values set
use the read() syntax to load the file (which means this is re-usable across multiple tests)
use the set keyword to update only the field for your scenario or negative test
You can get even more fancier if you use embedded expressions.
create a JSON file that has all your "happy path" values set and the values you want to vary look like foo: '##(foo)'
before using read() you init some variables for e.g. * def foo = 'bar' and if you use null that JSON key will even be removed from the JSON
read() the JSON. it is ready for use !
You can refer to this file that demonstrates some of these concepts for XML, and you may get more ideas: xml.feature

What is this url string concept called?

I have just been on a website and I noticed they have a strange query string structure in the URL, they seem to be key value pairs and when you make a change in the website form the values update in the URL.
Here is the URL:
http://www.holidaysplease.co.uk/holiday-finder/#{"d":"2016-06-1","a":[],"t":20,"r":200,"f":13,"tr":180,"s":[5,4,3],"ac":[],"c":[],"sh":[],"dh":[],"du":null,"b":"500-4407"}
Does anyone know what this concept is called? I recall seeing it once in a Java based web application but can someone reassure me how this is achieved and in what language?
It looks like this is a fragment identifier.
Wikipedia says:
The fragment identifier introduced by a hash mark # is the optional last part of a URL for a document. It is typically used to identify a portion of that document. The generic syntax is specified in RFC 3986. The hash mark separator in URIs does not belong to the fragment identifier.
RFC 3986 is defined here.
Before, I never saw that before. The upper information is what a little bit of research gave me back. I hope this is not completely wrong.
The text after a # in a URL is a fragment identifier, which is normally used to refer to a section in the document, but can contain any data which won't be sent in the request to the server, but can be read by the client using JavaScript.
In your example, the fragment identifier contains a data structure encoded with JSON, which is a serialization format supporting key-value pairs and arrays.
Here's the JSON from your example in a more readable form:
{
"d": "2016-06-1",
"a": [],
"t": 20,
"r": 200,
"f": 13,
"tr": 180,
"s": [
5,
4,
3
],
"ac": [],
"c": [],
"sh": [],
"dh": [],
"du": null,
"b": "500-4407"
}
The concept behind this is the data which the user entered is send to the server as the JSON structure in the URL. The server reads the string as a JSON and it process the request.
This process is very effective in WebForm and it can be done using the method called encodeURIComponent.
I think you noticed that when the date is changed, it just update the JSON in the URL. So, they send the data filled to the server in a JSON format.
In your URL,
d - days and year
du - duration
a - holiday type
t - temperature
r - rainfall
f- fight time
tr - travel
s - star for the hotels
b - budgets
Hope this information helps you :)
The part after the # is called a fragment identifier. Client-side javascript code can access the content of the fragment using location.hash. In this case the fragment contains json data. The browser will typically not even send the fragment to the web server, so it's only used client-side.
The most common use of the fragment identifier is to link to a specific element on a web page using it's id attribute. This is used for subsections of wikipedia articles:
https://en.wikipedia.org/wiki/Fragment_identifier#Basics
When the fragment contains json, you can inspect the data by opening up your browser's javascript console and calling this code.
JSON.parse(location.hash.substr(1))
This kind of scheme can be used by single page applications to store state in the url, so that you can bookmark it and share the url.

how to get unique filename in erlang yaws

I am creating application that upload media files(audio,video and images) using erlang in yaws web server.There multiple user can upload files concurrently so how can i get unique filename every time ?
Please suggest me solution.
Usually, you have two ways to do that :
1 : You store a base ID in a database (like "a") which will be the name for the next file uploaded, and you increment the ID each time you add a file ("b", "c", "aa", "aA", ...). For exemple, this is the the kind of techniques used by youtube or url-shortner.
2 : You hash the file (with MD5, SHA1, CRC32, ...) and the resulting hash will be the name of the file on your server. Since hash should be unique for each file, you shouldn't have name collision.
The hash technique is usually more CPU-intensive (because you will need to hash every file uploaded), but you can then provide the hash to your clients to check the integrity of the file.
If you want to keep the original name of the file, you will need a database (like mnesia if you are using erlang) to store each relation (ID -> orignal name, or Hash -> original name).

XML Schema - Allow Invalid Dates

Hi I am using biztalk's FlatFile parser (using XML schema) to part a CSV file. The CSV File sometimes contains invalid date - 1/1/1900. Currently the schema validation for the flat file fails because of invalid date. Is there any setting that I can use to allow the date to be used?
I dont want to read the date as string. I might be forced to if there is no other way.
You could change it to a valid XML date time (e.g., 1900-01-00:00:00Z) using a custom pipeline component (see examples here). Or you can just treat it as a string in your schema and deal with converting it later in a map, in an orchestration, or in a downstream system.
Here is a a C# snippet that you could put into a scripting functoid inside a BizTalk map to convert the string to an xs:dateTime, though you'll need to do some more work if you want to handle the potential for bad input data:
public string ConvertStringDateToDateTime(string param1)
{
return DateTime.Parse(inputDate).ToString("s",System.Globalization.DateTimeFormatInfo.InvariantInfo);
}
Also see this blog post if you're looking to do that in multiple places in a single map.

Resources