How to parse a formula in Pyspark? - parsing

i am new in Pyspark and I have some doubts.
I have a df like this:
+---+---+-------+
| a1| a2|formula|
+---+---+-------+
| 18| 12| a1+a2|
| 11| 1| a1-a2|
+---+---+-------+
I'm trying to parse the column 'formula' to create a new column with the formula resolved and obtain a df like this
+---+---+-------+----------------+
| a1| a2|formula|resolved_formula|
+---+---+-------+----------------+
| 18| 12| a1+a2| 30|
| 11| 1| a1-a2| 10|
+---+---+-------+----------------+
I have tried using
df2 = df.withColumn('resolved_formula', f.expr(df.formula))
df2.show()
but i'm obtaining this type error
TypeError: Column is not iterable
can someone help me?
Thank you very much!!

Here's a complicated way of doing what you intend to.
data_sdf = data_sdf. \
withColumn('new_formula', func.col('formula'))
# this thing can also be done in a single regex
# technically prefix a variable before all columns to be used in a lambda func
for column in data_sdf.columns:
if column != 'formula':
data_sdf = data_sdf. \
withColumn('new_formula', func.regexp_replace('new_formula', column, 'r.'+column))
# use `eval()` to evaluate the operation
data_sdf. \
rdd. \
map(lambda r: (r.a1, r.a2, r.formula, eval(r.new_formula))). \
toDF(['a1', 'a2', 'formula', 'resolved_formula']). \
show()
# +---+---+-------+----------------+
# | a1| a2|formula|resolved_formula|
# +---+---+-------+----------------+
# | 18| 12| a1+a2| 30|
# | 11| 1| a1-a2| 10|
# +---+---+-------+----------------+

Related

'OneHotEncoder' object has no attribute 'transform'

I am using Spark v3.0.0. My dataframe is:
indexer.show()
+------+--------+-----+
|row_id| city|index|
+------+--------+-----+
| 0|New York| 0.0|
| 1| Moscow| 3.0|
| 2| Beijing| 1.0|
| 3|New York| 0.0|
| 4| Paris| 2.0|
| 5| Paris| 2.0|
| 6|New York| 0.0|
| 7| Beijing| 1.0|
+------+--------+-----+
Then I want to use One hot encoding of the dataframe's column "index" and getting this error.
encoder = OneHotEncoder(inputCol="index", outputCol="encoding")
encoder.setDropLast(False)
indexer = encoder.transform(indexer)
----------------------------------------
AttributeErrorTraceback (most recent call last)
<ipython-input-32-70bbd67e6679> in <module>
1 encoder = OneHotEncoder(inputCol="index", outputCol="encoding")
2 encoder.setDropLast(False)
----> 3 indexer = encoder.transform(indexer)
AttributeError: 'OneHotEncoder' object has no attribute 'transform'
You need to fit it first - before fitting, the attribute does not exist indeed:
encoder = OneHotEncoder(inputCol="index", outputCol="encoding")
encoder.setDropLast(False)
ohe = encoder.fit(indexer) # indexer is the existing dataframe, see the question
indexer = ohe.transform(indexer)
See the example in the docs for more details on the usage.

How to make select with variable using Google Sheets

I have a Google sheet with several columns, where are recorded support requests from clients.
A B C
-+---------------+------------------------+-------------
1| Date-1 | John | Ticket-101
2| Date-1 | Anita | Ticket-102
3| Date-2 | John | Ticket-103
4| Date-3 | Dani | Ticket-104
5| Date-3 | Billy | Ticket-105
I want to create two new columns with statistical data about the clients. In these new columns, I want to have the client name and number of opened support tickets.
The end result must be:
A B C D E
-+---------------+------------+-------------+-----------+---------------
1| Date-1 | John | Ticket-101 | John | 2 |
2| Date-1 | Anita | Ticket-102 | Anita | 1 |
3| Date-2 | John | Ticket-103 | Dani | 1 |
4| Date-3 | Dani | Ticket-104 | Billy | 1 |
5| Date-3 | Billy | Ticket-105 |
I created the D column in this way:
=UNIQUE(QUERY(B1:B))
For counting how many times the client contact us I use:
=COUNTA(IFERROR(QUERY(B1:B, "select B where B='John'", 0)))
Of course, this is a very stupid solution, because for every new client I must to create a new formula with
....where B='Client name'".....
I'm wondering is it possible to create a formula in a way in which the name of the client is automatically populated? I imagine something like that:
=ARRAYFORMULA(COUNTA(IFERROR(QUERY(B1:B, "select B where B='value-of-D'", 0))))
=QUERY(B:C,
"select B,count(C)
where B!=''
group by B
label count(C)''", 0)
also you can order it like:
=QUERY(B:C,
"select B,count(C)
where B!=''
group by B
order by count(C) desc
label count(C)''", 0)
Try
=query(B:C, "Select B, count(C) where B<>'' group by B label B 'Name', count(C) 'Count'", 1)
and see if that works?

date comparision google sheet query

I am using Google-sheets.
I have a Data-table like:
x|A | B | C
1|date |randomNumber|
2| 20.02.2018 | 1243 |
3| 18.01.2018 | 2 |
4| 17.01.2018 | 1 |
and a overview table:
x|A | B | C
1|date |randomNumber|
2| 20.02.2018 | |
3| 17.01.2018 | |
I want to lookup the dates on the overview table and lookup their value in my data table. Not every date of the data sheet has to appear in the overview table. All colums are date-formatted.
My approach so far was:
=QUERY(Data ;"select B where A = date '"&TEXT(A2;"yyyy-mm-dd")&"'")
but i get an empty output, which should not be the case, I should get 1243.
Thanks already :)
=ARRAYFORMULA(VLOOKUP(A2:A3;DATA!A1:B4;2;0))
With QUERY (copied down from B2 to suit):
=QUERY(Data!A:C;"select B where A = date '"&TEXT(A2;"yyyy-mm-dd")&"'";0)

Counting number of occurrences in a column, and eliminating repeats based on another column

I'm trying to take what's essentially a sign-in sheet for students being tutored, and then list, for each course, how many visits and how many different students visited seeking help. It seems kinda complicated to me so hopefully I can explain it well enough.
In sheetA I have data as follows:
| A | B | C | D | E |
-+------------+---------+-----+-----+---------+
1| Name | Date | In | Out | Course |
-+------------+---------+-----+-----+---------+
2| Ann |##/##/## | # | # | MA101 |
3| Bob |##/##/## | # | # | MA101 |
4| Jim |##/##/## | # | # | MA101 |
5| Bob |##/##/## | # | # | MA101 |
6| Ann |##/##/## | # | # | MA101 |
7| Bob |##/##/## | # | # | MA101 |
8| Ann |##/##/## | # | # | CS101 |
Then in sheetB the output would be:
| A | B | C |
+-----------+-------+-------+
1| Course | Total | Unique|
+-----------+-------+-------+
2| MA101 | 6 | 3 | #This would be 3 because only 3 unique students came
3| CS101 | 1 | 1 |
So all courses are listed under A, the total visits for that course are in B, and C is the number of unique students that went for that course.
What I have so far:
In sheetB I have the formulas for A and B.
A2: =unique(transpose(split(ArrayFormula(concatenate('sheetA'!E2:E&" "))," ")))
B2: =arrayformula(if(len(A7:A),countif(transpose(split(ArrayFormula(concatenate('sheetA'!E2:E&" "))," ")),A7:A),iferror(1/0)))
If it helps to look at I put these equations broken up with comments for what I understand each part to do in this gist
I'm trying to figure out what to put in C2, and I'm just totally lost.
Even if anyone knows a better way to do what I did so far, i.e. more concise or something, because those were from another SO post.
YOu can do this easily with native formulas:
THe formulas are:
=UNIQUE(E3:E)
=COUNTIF(E3:E,F2)
=COUNTA(UNIQUE(FILTER(A3:A,E3:E=F2)))

Outer Join using multiple Google Sheets and the QUERY function

As an example, say I have the following sheets in the same workbook of a Google Doc:
SHEET1 | SHEET2
\ A | B | \ A | B | C | D
1| ID |Lookup | 1| Lookup| Name |Flavor | Color
2| 123 | 4445 | 2| 1234 |Whizzer|Cherry | Red
3| 234 | 4445 | 3| 4445 |Fizzer |Lemon | Yellow
4| 124 | 1234 | 4| 9887 |Sizzle |Lime | Blue
5| 767 | 1234 |
6| 555 | 9887 |
Obviously, Google Docs isn't made with relational databases in mind, but I am trying to obtain results similar to the SQL query
SELECT
SHEET1.ID,
SHEET2.*
FROM
SHEET1
LEFT JOIN
SHEET2
ON SHEET1.Lookup = SHEET2.Lookup
resulting in a table that looks like
SHEET3
\ A | B | C | D | E
1| ID |Lookup | Name |Flavor | Color
2| 123 | 4445 |Fizzer |Lemon | Yellow
3| 234 | 4445 |Fizzer |Lemon | Yellow
4| 124 | 1234 |Whizzer|Cherry | Red
5| 767 | 1234 |Whizzer|Cherry | Red
6| 555 | 9887 |Sizzle |Lime | Blue
but this is where I stand currently
SHEET3
\ A | B | C | D | E
1| | | | |
2| 123 | 4445 | #N/A | |
3| 234 | 4445 | | |
4| 124 | 1234 | | |
5| 767 | 1234 | | |
6| 555 | 9887 | | |
At the moment I have managed to use the QUERY function to grab the values from SHEET1 and have tried a few different QUERY functions in SHEET3!C1 in an attempt to "LEFT JOIN" the two sheets using this blog post as a reference. At this point, the two functions I am using are as follows.
SHEET3!A2=QUERY(SHEET1!A2:B20, "SELECT A,B")
SHEET3!C2=QUERY(SHEET2!A2:E20, "SELECT B,C,D WHERE A="""&B2&"""")
and hovering over the error in C2 reads "Query completed with an empty output". How can I join these sheets?
Additional references:
Google Docs syntax page for QUERY
Do this in Sheet3.
In cell A1, to get the correct headings:
={Sheet1!A1:B1,Sheet2!B1:D1}
In cell A2, to get the table of Joined data, try this formula:
=FILTER({Sheet1!A2:B,
VLOOKUP(Sheet1!B2:B, {Sheet2!A2:A, Sheet2!B2:D}, {2,3,4}, false)},
Sheet1!B2:B<>"")
I've written a comprehensive guide about this topic called:
'Mastering Join-formulas in Google Sheets'
If you copy SHEET1 into SHEET3 (A1) then in C2:
=vlookup($B2,Sheet2!$A:$D,column()-1,0)
copied across and down should give the results you show once you have added three column labels.
Sheet3!A2:
=ARRAYFORMULA(Sheet1!A2:B20)
Sheet3!C2:
=ARRAYFORMULA(VLOOKUP(B2:B20,A1:D50,{2,3,4},0))
The following Add-on will provide all you need: Formulas by Top Contributors, by using both the build-in SQL join types and the MATRIX formula:
=MATRIX(SQLINNERJOIN(Sheet1!A1:B6,2,Sheet2!A1:D4,1, TRUE),,"3")
It will yield the following outcome:
I've created an example file for you: SQL JOINS
Affiliation: as a Google Top Contributor I helped creating the Add-on

Resources