I would like to compare the run results of my pipeline. Getting the diff between jsons with the same schema though different data.
Run1 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7}
{"doc_id": 1, "entity": "New York", "start": 30, "end": 38} # Missing from Run2
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 8}
Run2 JSON
{"doc_id": 1, "entity": "Anthony", "start": 0, "end": 7} # same as in Run1
{"doc_id": 2, "entity": "Istanbul", "start": 0, "end": 10} # different end span
{"doc_id": 2, "entity": "Karim", "start": 10, "end": 15} # added in Run2, not in Run1
Based on the answer here my approach has been making a tuple out of the json values and then cogrouping using this large composite key made of some of the json values: How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?
Is there a better way to diff jsons with beam?
Code based on linked answer:
def make_kv_pair(x):
if x and isinstance(x, basestring):
x = json.loads(x)
""" Output the record with the x[0]+x[1] key added."""
key = tuple((x[dict_key] for dict_key in ["doc_id", "entity"]))
return (key, x)
class FilterDoFn(beam.DoFn):
def process(self, (key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
Pipeline code:
table_a = (p | 'ReadJSONRun1' >> ReadFromText("run1.json")
| 'SetKeysRun1' >> beam.Map(make_kv_pair))
table_b = (p | 'ReadJSONRun2' >> ReadFromText("run2.json")
| 'SetKeysRun2' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'removed', 'unchanged']
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | "WriteUnchanged" >> WriteToText("unchanged/", file_name_suffix="_unchanged.json.gz")
key_collections.changed | "WriteChanged" >> WriteToText("changed/", file_name_suffix="_changed.json.gz")
key_collections.added | "WriteAdded" >> WriteToText("added/", file_name_suffix="_added.json.gz")
key_collections.removed | "WriteRemoved" >> WriteToText("removed/", file_name_suffix="_removed.json.gz")
I am currently working with Highcharts in combination with the pattern fill module. When I set a pattern for a series in the chart, the pattern is shown but it has a transparent background. I need to set an additional background because the pattern is overlapping with another series which I don't want to see behind it. You can check this fiddle. So basically I don't want to see those three columns on the left behind the pattern. Any ideas how I can do that? I haven't seen any options to set an additional background, but maybe you know some trick. This is the code I am using for the pattern:
"color": {
"pattern": {
"path": {
"d": "M 0 0 L 10 10 M 9 -1 L 11 1 M -1 9 L 1 11"
},
"width": 10,
"height": 10,
"opacity": 1,
"color": "rgb(84,198,232)"
}
}
You need to set fill attribute as a path property:
"color": {
"pattern": {
"path": {
"d": "M 0 0 L 10 10 M 9 -1 L 11 1 M -1 9 L 1 11",
fill: 'red'
},
"width": 10,
"height": 10,
"opacity": 1,
"color": 'rgb(84,198,232)'
}
}
Live demo: https://jsfiddle.net/BlackLabel/m9rxwej5/
I guess there's been an update. backgroundColor should be set at pattern's root level:
"color": {
"pattern": {
"backgroundColor": 'red',
"path": {
"d": "M 0 0 L 10 10 M 9 -1 L 11 1 M -1 9 L 1 11",
},
"width": 10,
"height": 10,
"opacity": 1,
"color": 'rgb(84,198,232)',
}
}
https://jsfiddle.net/vL4fqhao/
I am trying to sum array of array and get average at the same time. The original data is in the form of JSON. I have to parse my data to array of array in order to render the graph. The graph does not accept array of hash.
I first convert the output to JSON using the definition below.
ActiveSupport::JSON.decode(#output.first(10).to_json)
And the result of the above action is shown below.
output =
[{"name"=>"aaa", "job"=>"a", "pay"=> 2, ... },
{"name"=>"zzz", "job"=>"a", "pay"=> 4, ... },
{"name"=>"xxx", "job"=>"a", "pay"=> 6, ... },
{"name"=>"yyy", "job"=>"a", "pay"=> 8, ... },
{"name"=>"aaa", "job"=>"b", "pay"=> 2, ... },
{"name"=>"zzz", "job"=>"b", "pay"=> 4, ... },
{"name"=>"xxx", "job"=>"b", "pay"=> 6, ... },
{"name"=>"yyy", "job"=>"b", "pay"=> 10, ... },
]
Then I retrieved the job and pay by converting to array of array.
ActiveSupport::JSON.decode(output.to_json).each { |h|
a << [h['job'], h['pay']]
}
The result of the above operation is as below.
a = [["a", 2], ["a", 4], ["a", 6], ["a", 8],
["b", 2], ["b", 4], ["b", 6], ["b", 10]]
The code below will give me the sum of each element in the form of array of array.
a.inject({}) { |h,(job, data)| h[job] ||= 0; h[job] += data; h }.to_a
And the result is as below
[["a", 20], ["b", 22]]
However, I am trying to get the average of the array. The expected output is as below.
[["a", 5], ["b", 5.5]]
I can count how many elements in an array and divide the sum array by the count array. I was wondering if there is an easier and more efficient way to get the average.
output = [
{"name"=>"aaa", "job"=>"a", "pay"=> 2 },
{"name"=>"zzz", "job"=>"a", "pay"=> 4 },
{"name"=>"xxx", "job"=>"a", "pay"=> 6 },
{"name"=>"yyy", "job"=>"a", "pay"=> 8 },
{"name"=>"aaa", "job"=>"b", "pay"=> 2 },
{"name"=>"zzz", "job"=>"b", "pay"=> 4 },
{"name"=>"xxx", "job"=>"b", "pay"=> 6 },
{"name"=>"yyy", "job"=>"b", "pay"=> 10 },
]
output.group_by { |obj| obj['job'] }.map do |key, list|
[key, list.map { |obj| obj['pay'] }.reduce(:+) / list.size.to_f]
end
The group_by method will transform your list into a hash with the following structure:
{"a"=>[{"name"=>"aaa", "job"=>"a", "pay"=>2}, ...], "b"=>[{"name"=>"aaa", "job"=>"b", ...]}
After that, for each pair of that hash, we want to calculate the mean of its 'pay' values, and return a pair [key, mean]. We use a map for that, returning a pair with:
They key itself ("a" or "b").
The mean of the values. Note that the values list has the form of a list of hashes. To retrieve the values, we need to extract the last element of each pair; that's what list.map { |obj| obj['pay'] } is used for. Finally, calculate the mean by suming all elements with .reduce(:+) and dividing them by the list size as a float.
Not the most efficient solution, but it's practical.
Comparing the answer with #EricDuminil's, here's a benchmark with a list of size 8.000.000:
def Wikiti(output)
output.group_by { |obj| obj['job'] }.map do |key, list|
[key, list.map { |obj| obj['pay'] }.reduce(:+) / list.size.to_f]
end
end
def EricDuminil(output)
count_and_sum = output.each_with_object(Hash.new([0, 0])) do |hash, mem|
job = hash['job']
count, sum = mem[job]
mem[job] = count + 1, sum + hash['pay']
end
result = count_and_sum.map do |job, (count, sum)|
[job, sum / count.to_f]
end
end
require 'benchmark'
Benchmark.bm do |x|
x.report('Wikiti') { Wikiti(output) }
x.report('EricDuminil') { EricDuminil(output) }
end
user system total real
Wikiti 4.100000 0.020000 4.120000 ( 4.130373)
EricDuminil 4.250000 0.000000 4.250000 ( 4.272685)
This method should be reasonably efficient. It creates a temporary hash with job name as key and [count, sum] as value:
output = [{ 'name' => 'aaa', 'job' => 'a', 'pay' => 2 },
{ 'name' => 'zzz', 'job' => 'a', 'pay' => 4 },
{ 'name' => 'xxx', 'job' => 'a', 'pay' => 6 },
{ 'name' => 'yyy', 'job' => 'a', 'pay' => 8 },
{ 'name' => 'aaa', 'job' => 'b', 'pay' => 2 },
{ 'name' => 'zzz', 'job' => 'b', 'pay' => 4 },
{ 'name' => 'xxx', 'job' => 'b', 'pay' => 6 },
{ 'name' => 'yyy', 'job' => 'b', 'pay' => 10 }]
count_and_sum = output.each_with_object(Hash.new([0, 0])) do |hash, mem|
job = hash['job']
count, sum = mem[job]
mem[job] = count + 1, sum + hash['pay']
end
#=> {"a"=>[4, 20], "b"=>[4, 22]}
result = count_and_sum.map do |job, (count, sum)|
[job, sum / count.to_f]
end
#=> [["a", 5.0], ["b", 5.5]]
It requires 2 passes, but the created objects aren't big. In comparison, calling group_by on a huge array of hashes isn't very efficient.
How about this (Single pass iterative average calculation)
accumulator = Hash.new {|h,k| h[k] = Hash.new(0)}
a.each_with_object(accumulator) do |(k,v),obj|
obj[k][:count] += 1
obj[k][:sum] += v
obj[k][:average] = (obj[k][:sum] / obj[k][:count].to_f)
end
#=> {"a"=>{:count=>4, :sum=>20, :average=>5.0},
# "b"=>{:count=>4, :sum=>22, :average=>5.5}}
Obviously average is just recalculated on every iteration but since you asked for them at the same time this is probably as close as you are going to get.
Using your "output" instead looks like
output.each_with_object(accumulator) do |h,obj|
key = h['job']
obj[key][:count] += 1
obj[key][:sum] += h['pay']
obj[key][:average] = (obj[key][:sum] / obj[key][:count].to_f)
end
#=> {"a"=>{:count=>4, :sum=>20, :average=>5.0},
# "b"=>{:count=>4, :sum=>22, :average=>5.5}}
as Sara Tibbetts comment suggests, my first step would be to convert it like this
new_a = a.reduce({}){ |memo, item| memo[item[0]] ||= []; memo[item[0]] << item[1]; memo}
which puts it in this format
{a: [2, 4, 6, 8], b: [2, 4, 6, 20]}
you can then use slice to filter the keys you want
new_a.slice!(key1, key2, ...)
Then do another pass through to do get the final format
new_a.reduce([]) do |memo, (k,v)|
avg = v.inject{ |sum, el| sum + el }.to_f / v.size
memo << [k,avg]
memo
end
I elected to use Enumerable#each_with_object with the object being an array of two hashes, the first to compute totals, the second to count the number of numbers that are totalled. Each hash is defined Hash.new(0), zero being the default value. See Hash::new for a fuller explanation, In short, if a hash defined h = Hash.new(0) does not have a key k, h[k] returns 0. (h is not modified.) h[k] += 1 expands to h[k] = h[k] + 1. If h does not have a key k, h[k] on the right of the equality returns 0.1
output =
[{"name"=>"aaa", "job"=>"a", "pay"=> 2},
{"name"=>"zzz", "job"=>"a", "pay"=> 4},
{"name"=>"xxx", "job"=>"a", "pay"=> 6},
{"name"=>"yyy", "job"=>"a", "pay"=> 8},
{"name"=>"aaa", "job"=>"b", "pay"=> 2},
{"name"=>"zzz", "job"=>"b", "pay"=> 4},
{"name"=>"xxx", "job"=>"b", "pay"=> 6},
{"name"=>"yyy", "job"=>"b", "pay"=>10}
]
htot, hnbr = output.each_with_object([Hash.new(0), Hash.new(0)]) do |f,(g,h)|
s = f["job"]
g[s] += f["pay"]
h[s] += 1
end
htot.merge(hnbr) { |k,o,n| o.to_f/n }.to_a
#=> [["a", 5.0], ["b", 5.5]]
If .to_a at the end is dropped the the hash {"a"=>5.0, "b"=>5.5} is returned. The OP might find that more useful than the array.
I've used the form of Hash#merge that uses a block to determine the values of keys that are present in both hashes being merged.
Note that htot={"a"=>20, "b"=>22} and hnbr=>{"a"=>4, "b"=>4}.
1 If the reader is wondering why h[k] on the left of = doesn't return zero as well, it's a different method: Hash#[]= versus Hash#[]
I have an array containing values and arrays like below:
arr = [
[ 0, [ [22,3],[23,5] ] ],
[ 0, [ [22,1],[23,2] ] ],
[ 1, [ [22,4],[23,4] ] ],
[ 1, [ [22,2],[23,4] ] ]
]
I want to calculate the average based on first two elements and want to have a result set either in hash or array as below:
result = {
22 => [(3+1)/2, (4+2)/2],
23 => [(5+2)/2, (4+4)/2]
}
where for example:
key is 22 and value is an array containing average of third elements in the input array grouped by the first element 3 and 1, 4 and 2 and sorted by the first element 0 and 1
How the array is created
Maybe it might be helpful to mention about my logic.
The array is obtained by the following code out of my ActiveRecord objects:
arr = u.feedbacks.map{|f| [f.week,
f.answers.map{|a| [a.question.id, a.name.to_i]}]}
where models are associated as below:
feedback belongs_to :user
feedback has_and_belongs_to_many :answers
answer belongs_to :question
For each question I wanted to create an array containing average of answers grouped by the feedback week.
With a bit of debugging, the following should help get much faster results:
Answer.
joins(:question, :feedbacks). # assuming that answer has_many feedbacks
group(["questions.id", "feedbacks.week"]). # assuming week is integer column
average("CAST(answers.name AS INT)"). # assuming that name is string-y column
each_with_object({}) do |(keys, average), hash|
question_id, week = keys
hash[question_id] ||= []
hash[question_id][week] = average
end
If you want to keep things the way they are (not advised), then one working (albeit hard-to-follow) solution is this:
arr = [
[0, [[22, 3], [23, 5]]],
[0, [[22, 1], [23, 2]]],
[1, [[22, 4], [23, 4]]],
[1, [[22, 2], [23, 4]]]
]
arr.each_with_object({}) do |(a, b), hash|
c, d, e, f = b.flatten
# for first row this will be c, d, e, f = 22, 3, 23, 5
hash[c] ||= []
hash[c][a] ||= []
hash[c][a] << d
hash[e] ||= []
hash[e][a] ||= []
hash[e][a] << f
end.each_with_object({}) do |(k, v), hash|
# k are your 'keys' like 22, 23
# v is an array of arrays that you want to find out the averages of
hash[k] = \
v.map do |array|
array.reduce(:+).fdiv(array.size)
end
end
If it were me and I could have my way, I would refactor the way arr was created from the first place, since
The dimensionality of the array is counter-intuitive
The transformation again take its toll, affecting readibility, in turn, maintainability.
But I have no more insights than what I could see from the code you have shown. So, I played around with it a little bit, perhaps the code below is what you wanted?
totals = {}
arr.each do |row|
index, answers = row
answers.each do |answer|
question, count = answer
totals[question] ||= []
totals[question][index] ||= []
totals[question][index] << count
end
end
Below is the output of totals and by then it's trivial to get your average.
{
22 =>[[3, 1], [4, 2]],
23=>[[5, 2], [4, 4]]
}
EDIT Below is the solution that I have worked out by using each_with_object I learned from #Humza
arr = [
[ 0, [ [22,3],[23,5] ] ],
[ 0, [ [22,1],[23,2] ] ],
[ 1, [ [22,4],[23,4] ] ],
[ 1, [ [22,2],[23,4] ] ]
]
result = arr.each_with_object({}) do |(index, feedbacks), totals|
feedbacks.each do |(question, count)|
totals[question] ||= {}
totals[question][index] ||= []
totals[question][index] << count
end
totals
end.each_with_object({}) do |(question, totals), result|
result[question] = totals.map do |(index, total)|
total.reduce(:+).fdiv(total.length)
end
end
puts result.inspect
## Output
# {22=>[2.0, 3.0], 23=>[3.5, 4.0]}
I would like to do a partial sort of an array before sorting the complete array. The following Will return an array sorted on “sortOrder”
[folders sortedArrayUsingDescriptors:#[NSSortDescriptor sortDescriptorWithKey:#"sortOrder" ascending:YES]]
The array will look like this: [x, a, z, w, y]
but I would like to sort on “name” in ascending order when “sortOrder” is equal to zero first.
So the final array would look like this: [a, x, z, w, y]
Does anyone have a idea on how to do this?
"folders": [
{
"sortOrder": 0,
"name": "x",
},
{
"sortOrder": 0,
"name": "a",
},
{
"sortOrder": 1,
"name": "z",
},
{
"sortOrder": 3,
"name": "y",
},
{
"sortOrder": 2,
"name": "w",
}
]
Pass in two sort descriptors, one for "sortOrder" and the second for "name":
NSArray *descriptors = #[
[NSSortDescriptor sortDescriptorWithKey:#"sortOrder" ascending:YES],
[NSSortDescriptor sortDescriptorWithKey:#"name" ascending:YES],
];
NSArray *sortedArray = [folders sortedArrayUsingDescriptors:descriptors];
With this setup, any two items with the same sort order will then be sorted by name.