CountVectorizer skips letters but returns count of words - countvectorizer

I have a list of words like below.
words = ['john', 'i', 'romeo', 'i', 'john', 'steve', 'k']
I apply CountVectorizer to get the count of words as below.
vec = CountVectorizer().fit(words)
word_library =
vec.transform(words)
sum_words = [(word, sum_words[0,
idx]) for word, idx in
vec.vocabulary.items()]
It returns
[('john', 2), ('romeo', 1),
('steve', 1)]
I would like to return the count of single letters too, they should not vanish in the process.
[('john', 2), ('i' 2), ('romeo', 1),
('steve', 1), ('k', 1)]

Related

String Indexer, CountVectorizer Pyspark on single row

Hi I'm faced with a problem whereby I have rows with two columns of an array of words.
column1, column2
["a", "b" ,"b", "c"], ["a","b", "x", "y"]
Basically I want to count the occurrence of each word between columns to end up with two arrays:
[1, 2, 1, 0, 0],
[1, 1, 0, 1, 1]
So "a" appears once in each array, "b" appears twice in column1 and once in column2, "c" only appears in column1, "x" and "y" only in column2. So on and so forth.
I've tried to look at the CountVectorizer function from the ml library, however not sure if that works rowwise, the arrays can be very large in each column? And 0 values (where one word appears in one column but not the other) don't seem to get carried through.
Any help appreciated.
For Spark 2.4+, you can do that using DataFrame API and built-in array functions.
First, get all the words for each row using array_union function. Then, use transform function to transform the words array, where for each element calculate the number of occurences in each column using size and array_remove functions:
df = spark.createDataFrame([(["a", "b", "b", "c"], ["a", "b", "x", "y"])], ["column1", "column2"])
df.withColumn("words", array_union("column1", "column2")) \
.withColumn("occ_column1",
expr("transform(words, x -> size(column1) - size(array_remove(column1, x)))")) \
.withColumn("occ_column2",
expr("transform(words, x -> size(column2) - size(array_remove(column2, x)))")) \
.drop("words") \
.show(truncate=False)
Output:
+------------+------------+---------------+---------------+
|column1 |column2 |occ_column1 |occ_column2 |
+------------+------------+---------------+---------------+
|[a, b, b, c]|[a, b, x, y]|[1, 2, 1, 0, 0]|[1, 1, 0, 1, 1]|
+------------+------------+---------------+---------------+

Simple Dask Frequency Count

I want to do a frequency count. Imagine this list of people and their age:
IN [110]: b = db.from_sequence([('alex', 31), ('cassee', 31), ('Wes', 25), ('Allison', 35)])
In [111]: b.map(lambda x: (x[1], 1))\
.foldby(lambda x: x[0], lambda total,x: total[1]+x[1]).compute()
Out[111]: [(31, 2), (25, (25, 1)), (35, (35, 1))]
The first tuple looks good (31, 2) meaning there were 2 occurrence of age 31. However, the format of the next two tuples is weird. I want the output to be the frequency count: [(31, 2), (25, 1), (35, 1)]
The invocation you want is as follows:
b.pluck(1).frequencies().compute()
The pluck does the job of selecting the "age" from each element. frequencies does what the name suggests :)
You could have done this in other ways too:
b.foldby(1, lambda x, y: x + 1, 0).compute()
meaning, use element 1 for grouping, and within each group add 1 to the value so far for each element, starting at 0;
from operator import add
from collections import Counter
b.fold(lambda x, y: x + Counter([y[1]]), add, initial=Counter()).compute()
which is rather complicated to explain...

Ruby Intro to Parallel Assignments

a = [1, 2, 3, 4]
b, c = 99, *a → b == 99, c == 1
b, *c = 99, *a → b == 99, c == [1, 2, 3, 4]
Can someone please throughly explained why in Ruby the asterisk makes the code return what it returns? I understand that the if an lvalue has an asterisk, it assigns rvalues to that lvalues. However, why does '*a' make 'c' return only the '1' value in the array and why does '*a' and '*c' cancel each other out?
In both cases, 99, *a on the right-hand side expands into the array [99, 1, 2, 3, 4]
In
b, c = 99, *a
b and c become the first two values of the array, with the rest of the array discarded.
In
b, *c = 99, *a
b becomes the first value from the array and c is assigned the rest (because of the splat on the left-hand side).
The 99, *a on the right-hand side is an example of where the square brackets around an array are optional in an assignment.
A simpler example:
a = 1, 2, 3 → a == [1, 2, 3]
Or a more explicit version of your example:
example = [99, *a] → example == [99, 1, 2, 3, 4]

Lua: Piece of code to a specific circumstance

I'm making a (poor) cryptography script in Lua and for this, I need to make a loop that will return a value for each number in a string, for example:
Input: 15, 18, 1, 20, 15, 18, 15, 5, 21, 1, 18, 15, 21, 16, 1, 4, 15, 18, 5, 9, 4, 5, 18, 15, 13, 1
And I want it to return each of these digits to a function which will do a certain math with them, then return the correspondent letter for each of the resulting numbers (15 will become 'o', 18 will become 'r' and so on)
Explaining in detail, I need the a piece of code to insert into a function that will:
Return each of the numbers in a string to a function.
After this, the function needs to convert the numbers into letters (as previously said).
Then a new function needs to insert the resulting letters in a new string.
Here's a brief example of how it needs to behave.
Input: 8, 5, 12, 12, 15
Result: 26, 7, 15, 15, 12 (These numbers aren't constant because of a hidden math made inside the function.)
Input: 26, 7, 15, 15, 12
Result: z, g, o, o, l
Input: z, g, o, o, l
Result: "zgool"
I think the source code of this project isn't necessary for this occasion, I'll just implement this code into the functions on the script. Please, someone (who understands what I meant) can help me?
local function my_poor_cryptography(s)
local codes = {}
-- string to numbers
for c in s:gmatch"%a" do
table.insert(codes, c:byte() - (c:find"%l" and 96 or 64))
end
-- math here (https://en.wikipedia.org/wiki/ROT13)
for j = 1, #codes do
codes[j] = (codes[j] + 12) % 26 + 1
end
-- numbers to string
s = s:gsub("%a",
function(c)
return c.char(table.remove(codes, 1) + (c:find"%l" and 96 or 64))
end)
return s
end
Usage:
local str = "Hello, World!"
str = my_poor_cryptography(str)
print(str) --> Uryyb, Jbeyq!
str = my_poor_cryptography(str)
print(str) --> Hello, World!

How to get a list of integers from a given set of possible integers in z3?

Minimal example is the following: Given a set of possible integers [1, 2, 3] create an arbitrary list of size 5 using z3py. Duplicates are allowed.
The expected result is something like [1, 1, 1, 1, 1] or [3, 1, 2, 2, 3], etc.
How to tackle this problem and how to implement 'choosing'? Finally, I would like to find all solutions which can be done by adding additional constraints as explained in link. Any help will be very appreciated.
The following should work:
from z3 import *
def choose(elts, acceptable):
s = Solver()
s.add(And([Or([x == v for v in acceptable]) for x in Ints(elts)]))
models = []
while s.check() == sat:
m = s.model ()
if not m:
break
models.append(m)
block = Not(And([v() == m[v] for v in m]))
s.add(block)
return models
print choose('a b c d e', [1, 2, 3])

Resources