String Indexer, CountVectorizer Pyspark on single row - machine-learning

Hi I'm faced with a problem whereby I have rows with two columns of an array of words.
column1, column2
["a", "b" ,"b", "c"], ["a","b", "x", "y"]
Basically I want to count the occurrence of each word between columns to end up with two arrays:
[1, 2, 1, 0, 0],
[1, 1, 0, 1, 1]
So "a" appears once in each array, "b" appears twice in column1 and once in column2, "c" only appears in column1, "x" and "y" only in column2. So on and so forth.
I've tried to look at the CountVectorizer function from the ml library, however not sure if that works rowwise, the arrays can be very large in each column? And 0 values (where one word appears in one column but not the other) don't seem to get carried through.
Any help appreciated.

For Spark 2.4+, you can do that using DataFrame API and built-in array functions.
First, get all the words for each row using array_union function. Then, use transform function to transform the words array, where for each element calculate the number of occurences in each column using size and array_remove functions:
df = spark.createDataFrame([(["a", "b", "b", "c"], ["a", "b", "x", "y"])], ["column1", "column2"])
df.withColumn("words", array_union("column1", "column2")) \
.withColumn("occ_column1",
expr("transform(words, x -> size(column1) - size(array_remove(column1, x)))")) \
.withColumn("occ_column2",
expr("transform(words, x -> size(column2) - size(array_remove(column2, x)))")) \
.drop("words") \
.show(truncate=False)
Output:
+------------+------------+---------------+---------------+
|column1 |column2 |occ_column1 |occ_column2 |
+------------+------------+---------------+---------------+
|[a, b, b, c]|[a, b, x, y]|[1, 2, 1, 0, 0]|[1, 1, 0, 1, 1]|
+------------+------------+---------------+---------------+

Related

Ruby Intro to Parallel Assignments

a = [1, 2, 3, 4]
b, c = 99, *a → b == 99, c == 1
b, *c = 99, *a → b == 99, c == [1, 2, 3, 4]
Can someone please throughly explained why in Ruby the asterisk makes the code return what it returns? I understand that the if an lvalue has an asterisk, it assigns rvalues to that lvalues. However, why does '*a' make 'c' return only the '1' value in the array and why does '*a' and '*c' cancel each other out?
In both cases, 99, *a on the right-hand side expands into the array [99, 1, 2, 3, 4]
In
b, c = 99, *a
b and c become the first two values of the array, with the rest of the array discarded.
In
b, *c = 99, *a
b becomes the first value from the array and c is assigned the rest (because of the splat on the left-hand side).
The 99, *a on the right-hand side is an example of where the square brackets around an array are optional in an assignment.
A simpler example:
a = 1, 2, 3 → a == [1, 2, 3]
Or a more explicit version of your example:
example = [99, *a] → example == [99, 1, 2, 3, 4]

How to get a list of integers from a given set of possible integers in z3?

Minimal example is the following: Given a set of possible integers [1, 2, 3] create an arbitrary list of size 5 using z3py. Duplicates are allowed.
The expected result is something like [1, 1, 1, 1, 1] or [3, 1, 2, 2, 3], etc.
How to tackle this problem and how to implement 'choosing'? Finally, I would like to find all solutions which can be done by adding additional constraints as explained in link. Any help will be very appreciated.
The following should work:
from z3 import *
def choose(elts, acceptable):
s = Solver()
s.add(And([Or([x == v for v in acceptable]) for x in Ints(elts)]))
models = []
while s.check() == sat:
m = s.model ()
if not m:
break
models.append(m)
block = Not(And([v() == m[v] for v in m]))
s.add(block)
return models
print choose('a b c d e', [1, 2, 3])

How to remove array elements and append it to the front of the array in ruby without using any inbuilt methods?

I have an array say [1,2,3,4,5,6,7,8]. I need to take an input from the user and remove the last input number of array elements and append it to the front of the array. This is what I have achieved
def test(number, array)
b = array - array[0...(array.length-1) - number]
array = array.unshift(b).flatten.uniq
return array
end
number = gets.chomp_to_i
array = [1,2,3,4,5,7,8,9]
now passing the argument to test gives me the result. However, there are two problems here. first is I want to find a way to do this append on the front without any inbuilt method.(i.e not using unshift).Second, I am using Uniq here, which is wrong since the original array values may repeat. So how do I still ensure to get the correct output? Can some one give me a better solution to this.
The standard way is:
[1, 2, 3, 4, 5, 7, 8, 9].rotate(-3) #=> [7, 8, 9, 1, 2, 3, 4, 5]
Based on the link I supplied in the comments, I threw this together using the answer to that question.
def test(number, array)
reverse_array(array, 0, array.length - 1)
reverse_array(array, 0, number - 1)
reverse_array(array, number, array.length - 1)
array
end
def reverse_array(array, low, high)
while low < high
array[low], array[high] = array[high], array[low]
low += 1
high -= 1
end
end
and then the tests
array = [1,2,3,4,5,7,8,9]
test(2, array)
#=> [8, 9, 1, 2, 3, 4, 5, 7]
array = [3, 4, 5, 2, 3, 1, 4]
test(2, array)
#=> [1, 4, 3, 4, 5, 2, 3]
Which I believe is what you're wanting, and I feel sufficiently avoids ruby built-ins (no matter what way you look at it, you're going to need to get the value at an index and set a value at an index to do this in place)
I want to find a way to do this append on the front without any inbuilt method
You can decompose an array during assignment:
array = [1, 2, 3, 4, 5, 6, 7, 8]
*remaining, last = array
remaining #=> [1, 2, 3, 4, 5, 6, 7]
last #=> 8
The splat operator (*) gathers any remaining elements. The last element will be assigned to last, the remaining elements (all but the last element) are assigned to remaining (as a new array).
Likewise, you can implicitly create an array during assignment:
array = last, *remaining
#=> [8, 1, 2, 3, 4, 5, 6, 7]
Here, the splat operator unpacks the array, so you don't get [8, [1, 2, 3, 4, 5, 6, 7]]
The above moves the last element to the front. To rotate an array n times this way, use a loop:
array = [1, 2, 3, 4, 5, 6, 7, 8]
n = 3
n.times do
*remaining, last = array
array = last, *remaining
end
array
#=> [6, 7, 8, 1, 2, 3, 4, 5]
Aside from times, no methods were called explicitly.
You could create a new Array with the elements at the correct position thanks to modulo:
array = %w[a b c d e f g h i]
shift = 3
n = array.size
p Array.new(n) { |i| array[(i - shift) % n] }
# ["g", "h", "i", "a", "b", "c", "d", "e", "f"]
Array.new() is a builtin method though ;)

Elixir: Split list into odd and even elements as two items in tuple

I am quiet new to Elixir programming and stuck badly at splitting into two elements tuple.
Given a list of integers, return a two element tuple. The first element is a list of the even numbers from the list. The second is a list of the odd numbers.
Input : [ 1, 2, 3, 4, 5 ]
Output { [ 2, 4], [ 1, 3, 5 ] }
I have reached to identify the odd or even but not sure how do I proceed.
defmodule OddOrEven do
import Integer
def task(list) do
Enum.reduce(list, [], fn(x, acc) ->
case Integer.is_odd(x) do
:true -> # how do I get this odd value listed as a tuple element
:false -> # how do I get this even value listed as a tuple element
end
#IO.puts(x)
end
)
end
You can use Enum.partition/2:
iex(1)> require Integer
iex(2)> [1, 2, 3, 4, 5] |> Enum.partition(&Integer.is_even/1)
{[2, 4], [1, 3, 5]}
If you really want to use Enum.reduce/2, you can do this:
iex(3)> {evens, odds} = [1, 2, 3, 4, 5] |> Enum.reduce({[], []}, fn n, {evens, odds} ->
...(3)> if Integer.is_even(n), do: {[n | evens], odds}, else: {evens, [n | odds]}
...(3)> end)
{[4, 2], [5, 3, 1]}
iex(4)> {Enum.reverse(evens), Enum.reverse(odds)}
{[2, 4], [1, 3, 5]}
Or you can use the Erlang :lists module:
iex> :lists.partition(fn (n) -> rem(n, 2) == 1 end, [1,2,3,4,5])
{[1,3,5],[2,4]}

Rails 3. How to get the difference between two arrays?

Let’s say I have this array with shipments ids.
s = Shipment.find(:all, :select => "id")
[#<Shipment id: 1>, #<Shipment id: 2>, #<Shipment id: 3>, #<Shipment id: 4>, #<Shipment id: 5>]
Array of invoices with shipment id's
i = Invoice.find(:all, :select => "id, shipment_id")
[#<Invoice id: 98, shipment_id: 2>, #<Invoice id: 99, shipment_id: 3>]
Invoices belongs to Shipment.
Shipment has one Invoice.
So the invoices table has a column of shipment_id.
To create an invoice, I click on New Invoice, then there is a select menu with Shipments, so I can choose "which shipment am i creating the invoice for". So I only want to display a list of shipments that an invoice hasn't been created for.
So I need an array of Shipments that don't have an Invoice yet. In the example above, the answer would be 1, 4, 5.
a = [2, 4, 6, 8]
b = [1, 2, 3, 4]
a - b | b - a # => [6, 8, 1, 3]
First you would get a list of shipping_id's that appear in invoices:
ids = i.map{|x| x.shipment_id}
Then 'reject' them from your original array:
s.reject{|x| ids.include? x.id}
Note: remember that reject returns a new array, use reject! if you want to change the original array
Use substitute sign
irb(main):001:0> [1, 2, 3, 2, 6, 7] - [2, 1]
=> [3, 6, 7]
Ruby 2.6 is introducing Array.difference:
[1, 1, 2, 2, 3, 3, 4, 5 ].difference([1, 2, 4]) #=> [ 3, 3, 5 ]
So in the case given here:
Shipment.pluck(:id).difference(Invoice.pluck(:shipment_id))
Seems a nice elegant solution to the problem. I've been a keen follower of a - b | b - a, though it can be tricky to recall at times.
This certainly takes care of that.
Pure ruby solution is
(a + b) - (a & b)
([1,2,3,4] + [1,3]) - ([1,2,3,4] & [1,3])
=> [2,4]
Where a + b will produce a union between two arrays
And a & b return intersection
And union - intersection will return difference
The previous answer here from pgquardiario only included a one directional difference. If you want the difference from both arrays (as in they both have a unique item) then try something like the following.
def diff(x,y)
o = x
x = x.reject{|a| if y.include?(a); a end }
y = y.reject{|a| if o.include?(a); a end }
x | y
end
This should do it in one ActiveRecord query
Shipment.where(["id NOT IN (?)", Invoice.select(:shipment_id)]).select(:id)
And it outputs the SQL
SELECT "shipments"."id" FROM "shipments" WHERE (id NOT IN (SELECT "invoices"."shipment_id" FROM "invoices"))
In Rails 4+ you can do the following
Shipment.where.not(id: Invoice.select(:shipment_id).distinct).select(:id)
And it outputs the SQL
SELECT "shipments"."id" FROM "shipments" WHERE ("shipments"."id" NOT IN (SELECT DISTINCT "invoices"."shipment_id" FROM "invoices"))
And instead of select(:id) I recommend the ids method.
Shipment.where.not(id: Invoice.select(:shipment_id).distinct).ids
When dealing with arrays of Strings, it can be useful to keep the differences grouped together.
In which case, we can use Array#zip to group the elements together and then use a block to decide what to do with the grouped elements (Array).
a = ["One", "Two", "Three", "Four"]
b = ["One", "Not Two", "Three", "For" ]
mismatches = []
a.zip(b) do |array|
mismatches << array if array.first != array.last
end
mismatches
# => [
# ["Two", "Not Two"],
# ["Four", "For"]
# ]
s.select{|x| !ids.include? x.id}

Resources