dask - read_json into dataframe ValueError - dask

A minimal example here: I have a json file xaa.json whose contents looks like this (two rows from stackoverflow archive):
[
{"Id": 11, "Body": "<p>Given a specific <code>DateTime</code> value", "Title": "Calculate relative time in C#", "Comments": "There is the .net package https://github.com/NickStrupat/TimeAgo which pretty much does what is being asked."},
{"Id": 7888, "Body": "<p>You need to use an <code>ifstream</code> if you just want to read (use an <code>ofstream</code> to write, or an <code>fstream</code> for both).</p>
<p>To open a file in text mode, do the following:</p>
<pre><code>ifstream in(\\"filename.ext\\", ios_base::in); // the in flag is optional
</code></pre>
<p>To open a file in binary mode, you just need to add the \\"binary\\" flag.</p>
<pre><code>ifstream in2(\\"filename2.ext\\", ios_base::in | ios_base::binary );
</code></pre>
<p>Use the <code>ifstream.read()</code> function to read a block of characters (in binary or text mode). Use the <code>getline()</code> function (it's global) to read an entire line.</p>
", "Title": null, "Comments": "+1 for noting that the global getline() function is to be used instead of the member function."}
]
I want to load such json files into a dask dataframe. I use:
so_posts_df = dd.read_json('./xaa.json', orient='columns').compute()
I get this error:
ValueError: Unexpected character found when decoding object value
After looking into the contents, I figured that the "\\"' stuff was causing it. So, when I removed them, (the editor - IntelliJ said it was clean and nice looking JSON) and when I ran the same read_json, it was able to read into a df and display them nicely.
So, I have 2 questions: (a) what are the values for the read_json argument "errors" ? (b) How can I properly preprocess the json file before reading into dask dataframe? The presence of double-quotes and the double-escaping seems to be causing an issue.
[This may not be a dask issue at all...]...

This also fails with pandas.read_json. I recommend first trying to get things to work well with Pandas, and then try the same workload with dask dataframe. You will likely get much better support when asking Pandas questions.

Related

Puppet3 | read values from different yaml file

So I'm using puppet3 and I have X.yaml and Y.yaml. X.yaml has profiles::resolv_conf::nameservers: [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ]in it. I want to add that [ '1.1.1.1', '8.8.8.8', '2.2.2.2' ] as a value to the servers: which is in Y.yaml:
'dns_test':
plugin_type: 'dns_query'
options:
'servers': \['1.1.1.1', '8.8.8.8', "2.2.2.2"\]
'domains': \['google.com'\]
'record_type': 'A'
'timeout': 5
tags:
'input_source': 'dns_query'
By doing this I want to make sure that when someone change values in profiles::resolv_conf::nameservers: that value is changed in this telegraf plugin too.
I tried multiple solution but the one that was the closest was:
'dns_test':
plugin_type: 'dns_query'
options:
'servers': "%{hiera('profiles::resolv_conf::nameservers')}"
'domains': ['google.com']
'record_type': 'A'
'timeout': 5
tags: 'input_source': 'dns_query'
but problem is that puppet was adding extra " " to the value and final value in plugin conf was:
"["1.1.1.1", "2.2.2.2", "8.8.8.8"]" instead of ["1.1.1.1", "2.2.2.2", "8.8.8.8"]
TL;DR: You can't.
From the current docs and the Puppet documentation archive, I confirm that no version of the %{hiera} interpolation function or its replacement, %{lookup}, ever supported interpolating values other than strings. That's expressed in the current docs like so:
The lookup and hiera interpolation functions look up a key and return
the resulting value. The result of the lookup must be a string; any
other result causes an error.
(Emphasis added)
What you're looking for would be supported by Hiera 5's %{alias} function, provided that the data are available somewhere else in the same hierarchy (which is also a requirement for %{hiera}). Since you're stuck on Puppet 3, however, you're probably on Hiera 2, and certainly not later than Hiera 3.
"But wait!" You may say. "I'm getting a successful interpolation, but the data are just munged". Specifically, you wrote:
problem is that puppet was adding extra " " to the value and final value
Since %{hiera()} interpolates only strings, it is not surprising that you got a string value, given that you got a value at all. I do find it a bit surprising that Puppet did not throw an error, but I'm not prepared to comment further on that without a minimum reproducible example that demonstrates the behavior.

Sending IFS File to Outq Prints Line of "#" Symbols

I am attempting to send a file from IFS to an outq on our AS/400 system. Whenever I do, I get exactly what I send, as well as a line of "#" symbols of varying lengths appended to the end.
Here's the command I'm using:
qsh cmd('cat -c /path/test.txt | Rfile -wbQ -c "ovrprtf file(qprint)
outq(*LIBL/ABCD) devtype(*USERASCII) rplunprt(*no) splfname(test) hold(*no)"
qprint')
The contents of test.txt is just Hello World!
The output I get when I send the command is
Hello World!####################################################################
I have not found any posts online about a similar problem, and have tried changing values and looking for additional switches to get it to work. Nothing I'm doing seems to fix the issue.
Is there a command or switch that I am missing, or is something I have in there already causing this?
EDIT:
I found this documentation which is the first time I've seen this issue mentioned, but it's not very helpful:
“Messages for a Take Action command might consist of a long string of "at" symbols (#) in a pop-up message. (The Reflex automation Take Action command, which is configured in situations, does not have this problem.) A resolution for this problem is under construction. This problem might be resolved by the time of the product release. If you see this problem, contact IBM Software Support.”
The only differences are: 1) this is not a pop-up message, it's printed. 2) I don't believe we use Tivoli Monitoring, although I could be wrong.
Assuming we do use Tivoli Monitoring, what would the solution be? There's no additional documentation past that, and I am not a system administrator, so I can't really make the call to IBM Software Support myself. And assuming we DON'T use it, what else could cause this issue?
I get different results, yet similar. I created a test.txt with Windows Explorer, put in Hello, world!, saved it and tried the script. I got gibberish for the 'Hello, world!' and then the line of # symbols.
My system is 7.3 TR5, CCSID 37 (US English) and my IFS file is CCSID 1252 (Windows English). Results did not change if I used a stream file of CCSID 819 (US ASCII).
I didn't have any luck modifying Rfile switches.
I found that removing devtype(*userascii) produced printed output in plain English without the # symbols. Do you really need *USERASCII? I would think that would be more for a pre-formatted 'print-ready' file like Postscript or the like.
EDIT: some more things to try
I don't understand why *USERASCII is adding those # symbols; it looks like a translation issue.
I tried this and still got the extra ###... You might have to play with the TOCCSID() parameter. Although a failure, it did give me an idea: what if those # symbols are EBCDIC spaces being sent as-is to the *USERASCII print stream? All we'd need is a way to send only the number of bytes in the stream file, without any padding.
CRTPF FILE(QTEMP/PRTSTMF) RCDLEN(132)
CPY OBJ('/path/test.txt') TOOBJ('/qsys.lib/qtemp.lib/prtstmf.file/prtstmf.mbr') replace(*yes)
ovrprtf file(qprint) outq(*LIBL/prt3812) devtype(*USERASCII) rplunprt(*no) splfname(test) hold(*no)
cpyf prtstmf qprint
The data in QTEMP/PRTSTMF is in ASCII; DSPPFM shows that much. It also shows a bunch of spaces: after all, it is a fixed length file. My next step was to write an RPG program to read the stream file and print it, but Scott Klement already did that: http://www.scottklement.com/PrtStmf.zip
This works on my system:
ovrprtf file(qsysprt) outq(*LIBL/abcd) devtype(*USERASCII) rplunprt(*no) splfname(test) hold(*no)
prtstmf stmf('/path/test.txt') outq(abcd)

Jmeter Parameter CSV not recognizing variable

I am confused on a Jmeter variable not getting picked up by the CSV Data config. I have a Thread with HTTP request, CSV Data Config, HTTP Header Manager, and Results Tree. Everything seems to work fine, but there is just one variable that is not recognized...
Here is the Request Body after running the test:
{
"W_ID": "${W_ID}",
"b": "b",
"c": "c",
"d": "d"
}
For some reason the W_ID variable is not being recognized, but other variables are. All rows have the correct value assigned to them except the W_ID. I tried deleting the W_ID column from my file(in case there was weird formatting or white space), saving, and re-running the test, but same results.
Any ideas? Thanks for your help! Please let me know if I can provide more information or clarity.
Edit1:
I noticed that the object name shows up in the body of the service... could that have an impact? This is the body (inv_adj is the object name):
{
"inv_adj": {
"W_ID": "string",
"a": "string",
"b": "string",
"c": "string",
}
Edit2:
CSV variables were requested:
Row 1: W_ID, b, c, d
Row 2: a, b, c, d
In JMeter, If Variables are referenced as follows:
${VARIABLE}
If an undefined function or variable is referenced, JMeter does not report/log an error - the reference is returned unchanged. For example, if UNDEF is not defined as a variable, then the value of ${UNDEF} is ${UNDEF}.
So, Double check your CSV Data Set Config that how you have defined your variable name for each row. Is it WarehouseID or W_ID in your CSV data set config? If you use as WarehouseID in your CSV data set config, then you should use like {"W_ID": "${WarehouseID}"}in your HTTP Sampler's body.
Edit:
Here is an example step by step:
CSV Data Set:
CSV Data Set Config:
Request Body Before Test:
Request Body After Test in Results Tree:
I tried to reproduce your issue locally on my JMeter instance. But I could not reproduce the error that you are facing. Unless we have your entire data file and the JMeter test plan, it is difficult to understand the issue. Please find below my test plan
And then, look at the sampler configuration
When I replay this, I can see that the values are getting substituted properly.

Ruby - extract info from JSON with variable loop iteration

I have a JSON response which is stored as a string in "BQresponse"
{"kind":"bigquery#queryResponse", "schema":{"fields":[{"name":"Revenue", "type":"INTEGER", "mode":"NULLABLE"}, {"name":"Country", "type":"STRING", "mode":"NULLABLE"}]}, "jobReference":{"projectId":"curious-idea-532", "jobId":"job_S5rTcY2vwEu-amtrxb8NRPWiynU"}, "totalRows":"3", "rows":[{"f":[{"v":"100"}, {"v":"Ireland"}]}, {"f":[{"v":"200"}, {"v":"Netherlands"}]}, {"f":[{"v":"50"}, {"v":"Singapore"}]}], "totalBytesProcessed":"0", "jobComplete":true, "cacheHit":true}
I am trying to convert this into a two line response (for later export to CSV), looking exactly like this:
Country||Sum of Revenue|,Ireland,Netherlands,Singapore
Revenue,100,200,50
So far, I've extracted the first parts, like so:
puts BQresponse[/#{D1_mark1}(.*?)#{D1_mark2}/m, 1]+"||"+BQresponse[/#{M1_mark1}(.*?)#{M1_mark2}/m, 1]
Next I need to extract "Ireland,Netherlands,Singapore". However I cannot use the same approach as I have done above as there may be more or less values as the string is updated (maybe only 2 or 5 countries).
The string included a part that says "totalRows":"3"," - this 3 is the number of expected countries and I suppose could be used in a loop/for-each of some sort. But I'm not sure how to best approach this.
The number values on the second line face the exact same issue (each country has a number). The "Revenue" on the second line is simply a repeat of "Revenue" on the first line, with "Sum_of_" removed.
Appreciate suggestions on what direction to head in.
Also, this is a valid JSON, if I'm completely off track and it would be easier to convert this string into a JSON first, that's okay too.
Thanks!
There's an awesome gem for this, json2csv here that I've had to use before.
To try it out, I'd save down a sample JSON response into a file called sample.json and then in your terminal you can run:
json2csv convert sample.json

Generating an avro schema with optional values

I am trying to write a very easy avro schema (easy because I am just pointing out my current issue) to write an avro data file based on data stored in json format. The trick is that one field is optional, and one of avrotools or me is not doing it right.
The goal is not to write my own serialiser, the endgoal will be to have this in flume, I am in the early stages.
The data (works), in a file named so.log:
{
"valid": {"boolean":true}
, "source": {"bytes":"live"}
}
The schema, in a file named so.avsc:
{
"type":"record",
"name":"Event",
"fields":[
{"name":"valid", "type": ["null", "boolean"],"default":null}
, {"name":"source","type": ["null", "bytes"],"default":null}
]
}
I can easily generate an avro file with the following command:
java -jar avro-tools-1.7.6.jar fromjson --schema-file so.avsc so.log
So far so good. The thing is that "source" is optional, so I would expect the following data to be valid as well:
{
"valid": {"boolean":true}
}
But running the same command gives me the error:
Exception in thread "main" org.apache.avro.AvroTypeException: Expected start-union. Got END_OBJECT
at org.apache.avro.io.JsonDecoder.error(JsonDecoder.java:697)
at org.apache.avro.io.JsonDecoder.readIndex(JsonDecoder.java:441)
at org.apache.avro.io.ResolvingDecoder.doAction(ResolvingDecoder.java:229)
at org.apache.avro.io.parsing.Parser.advance(Parser.java:88)
at org.apache.avro.io.ResolvingDecoder.readIndex(ResolvingDecoder.java:206)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:155)
at org.apache.avro.generic.GenericDatumReader.readField(GenericDatumReader.java:193)
at org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:183)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:151)
at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:142)
at org.apache.avro.tool.DataFileWriteTool.run(DataFileWriteTool.java:99)
at org.apache.avro.tool.Main.run(Main.java:84)
at org.apache.avro.tool.Main.main(Main.java:73)
I did try a lot of variations in the schema, even things that do not follow the avro spec. The schema I show here is, as far as I know, what the spec says it should be.
Would anybody know what I am doing wrong, and how I can actually have optional elements without writing my own serialiser?
Thanks,
According to the documentation of the Java API:
using a builder requires setting all fields, even if they are null
The python API, on the other hand, seems to allow null fields to be really optional:
Since the field favorite_color has type ["string", "null"], we are not required to specify this field
In short, as most tools are written Java, null fields must usually be explicitly given.

Resources