'OneHotEncoder' object has no attribute 'transform' - machine-learning

I am using Spark v3.0.0. My dataframe is:
indexer.show()
+------+--------+-----+
|row_id| city|index|
+------+--------+-----+
| 0|New York| 0.0|
| 1| Moscow| 3.0|
| 2| Beijing| 1.0|
| 3|New York| 0.0|
| 4| Paris| 2.0|
| 5| Paris| 2.0|
| 6|New York| 0.0|
| 7| Beijing| 1.0|
+------+--------+-----+
Then I want to use One hot encoding of the dataframe's column "index" and getting this error.
encoder = OneHotEncoder(inputCol="index", outputCol="encoding")
encoder.setDropLast(False)
indexer = encoder.transform(indexer)
----------------------------------------
AttributeErrorTraceback (most recent call last)
<ipython-input-32-70bbd67e6679> in <module>
1 encoder = OneHotEncoder(inputCol="index", outputCol="encoding")
2 encoder.setDropLast(False)
----> 3 indexer = encoder.transform(indexer)
AttributeError: 'OneHotEncoder' object has no attribute 'transform'

You need to fit it first - before fitting, the attribute does not exist indeed:
encoder = OneHotEncoder(inputCol="index", outputCol="encoding")
encoder.setDropLast(False)
ohe = encoder.fit(indexer) # indexer is the existing dataframe, see the question
indexer = ohe.transform(indexer)
See the example in the docs for more details on the usage.

Related

Mapping timeseries+static information into an ML model (XGBoost)

So lets say I have multiple probs, where one prob has two input DataFrames:
Input:
One constant stream of data (e.g. from a sensor) Second step: Multiple streams from multiple sensors
> df_prob1_stream1
timestamp | ident | measure1 | measure2 | total_amount |
----------------------------+--------+--------------+----------+--------------+
2019-09-16 20:00:10.053174 | A | 0.380 | 0.08 | 2952618 |
2019-09-16 20:00:00.080592 | A | 0.300 | 0.11 | 2982228 |
... (1 million more rows - until a pre-defined ts) ...
One static DataFrame of information, mapped to an unique identifier called ident, which needs to be assigned to the ident column in each df_probX_streamX in order to let the system recognize, that this data is related.
> df_global
ident | some1 | some2 | some3 |
--------+--------------+----------+--------------+
A | LARGE | 8137 | 1 |
B | SMALL | 1234 | 2 |
Output:
A binary classifier [0,1]
So how can I suitable train XGBoost to be able to make the best usage of one timeseries DataFrame in combination with one static DataFrame (containg additional context information) in one prob? Any help would be appreciated.

Dask read_json metadata mismatch

I'm trying to load json files into a dask df.
files = glob.glob('**/*.json', recursive=True)
df = dd.read_json(files, lines = False)
There are some missing values in the data, and some of the files have extra columns.
Is there a way to specify a column list, so all possible columns will exist in the concatenated dask df?
Additionally, can't it handle missing values? I get the following error when trying to compute the df:
ValueError: Metadata mismatch found in `from_delayed`.
Partition type: `DataFrame`
+-----------------+-------+----------+
| Column | Found | Expected |
+-----------------+-------+----------+
| x22 | - | float64 |
| x21 | - | object |
| x20 | - | float64 |
| x19 | - | float64 |
| x18 | - | object |
| x17 | - | float64 |
| x16 | - | object |
| x15 | - | object |
| x14 | - | object |
| x13 | - | object |
| x12 | - | object |
| x11 | - | object |
| x10 | - | object |
| x9 | - | float64 |
| x8 | - | object |
| x7 | - | object |
| x6 | - | object |
| x5 | - | int64 |
| x4 | - | object |
| x3 | - | float64 |
| x2 | - | object |
| x1 | - | object |
+-----------------+-------+----------+
read_json() is new and tested for the "common" case of homogenous data. It could, like read_csv, be extended to cope with column selection and data type coercion fairly easily. I note that the pandas function allows the passing of a dtype= parameter.
This is not an answer, but perhaps you would be interested in submitting a PR at the repo ? The specific code lives in file dask.dataframe.io.json.
I bumped into similar problem and came up with another solution:
def read_data(path, **kwargs):
meta = dd.read_json(path, **kwargs).head(0)
meta = meta.head(0)
# edit meta dataframe to match what's read here
def json_engine(*args, **kwargs):
df = pd.read_json(*args, **kwargs)
# add or drop necessary columns here
return df
return dd.read_json(path, meta=meta, engine=json_engine, **kwargs)
So idea of this solution is that you do two things:
Edit meta as you see fit (for example removing column from it which you don't need)
Wrapping json engine function and dropping/adding necessary columns so meta will match what's returned by this function.
Examples:
You have one particular irrelevant column which cause your code to fail with error:
| Column | Found | Expected |
| x22 | - | object |
In this case you simply drop this column from meta and in your json_engine() wrapper.
You have some relevant columns which are reported missing for some partitions. In this case you get similar error to topic starter.
In this case you add necessary columns to meta with necessary types (BTW meta is just empty pandas dataframe in this case) and you also add those columns as empty in your json_engine() wrapper if necessary.
Also look at proposal in comments to https://stackoverflow.com/a/50929229/2727308 answer - to use dask.bag instead.
I added the pandas read_json kwarg dtype as object so all the columns are inferred as objects:
df = dd.read_json(files, dtype=object)

date comparision google sheet query

I am using Google-sheets.
I have a Data-table like:
x|A | B | C
1|date |randomNumber|
2| 20.02.2018 | 1243 |
3| 18.01.2018 | 2 |
4| 17.01.2018 | 1 |
and a overview table:
x|A | B | C
1|date |randomNumber|
2| 20.02.2018 | |
3| 17.01.2018 | |
I want to lookup the dates on the overview table and lookup their value in my data table. Not every date of the data sheet has to appear in the overview table. All colums are date-formatted.
My approach so far was:
=QUERY(Data ;"select B where A = date '"&TEXT(A2;"yyyy-mm-dd")&"'")
but i get an empty output, which should not be the case, I should get 1243.
Thanks already :)
=ARRAYFORMULA(VLOOKUP(A2:A3;DATA!A1:B4;2;0))
With QUERY (copied down from B2 to suit):
=QUERY(Data!A:C;"select B where A = date '"&TEXT(A2;"yyyy-mm-dd")&"'";0)

Concatenating nodes from a query into a single line for export to csv in Neo4J using Cypher

I have a neo4J graph that represents a chess tournament.
Say I run this:
MATCH (c:ChessMatch {m_id: 1"})-[:PLAYED]-(p:Player) RETURN *
This gives me the results of the two players who played in a chess match.
The graph looks like this:
And the properties are something like this:
|--------------|------------------|
| (ChessMatch) | |
| m_id | 1 |
| date | 1969-05-02 |
| comments | epic battle |
|--------------|------------------|
| (player) | |
| p_id | 1 |
| name | Roy Lopez |
|--------------|------------------|
| (player) | |
| p_id | 2 |
| name | Aron Nimzowitsch |
|--------------|------------------|
I'd like to export this data to a csv, which would look like this:
| m_id | date | comments | p_id_A | name_A | p_id_B | name_B |
|------|------------|-------------|--------|-----------|--------|------------------|
| 1 | 1969-05-02 | epic battle | 1 | Roy Lopez | 2 | Aron Nimzowitsch |
Googling around, surprisingly, I didn't find any solid answers. The best I could think of is so just use py2neo and pull down all the data as separate tables and merge in Pandas, but this seems uninspiring. Any ideas on how to do in cypher would be greatly illuminating.
APOC has a procedure for that :
apoc.export.csv.query
Check https://neo4j-contrib.github.io/neo4j-apoc-procedures/index32.html#_export_import for more details. Note that you'll have to add the following to neo4j.conf :
apoc.export.file.enabled=true
Hope this helps.
Regards,
Tom

Counting number of occurrences in a column, and eliminating repeats based on another column

I'm trying to take what's essentially a sign-in sheet for students being tutored, and then list, for each course, how many visits and how many different students visited seeking help. It seems kinda complicated to me so hopefully I can explain it well enough.
In sheetA I have data as follows:
| A | B | C | D | E |
-+------------+---------+-----+-----+---------+
1| Name | Date | In | Out | Course |
-+------------+---------+-----+-----+---------+
2| Ann |##/##/## | # | # | MA101 |
3| Bob |##/##/## | # | # | MA101 |
4| Jim |##/##/## | # | # | MA101 |
5| Bob |##/##/## | # | # | MA101 |
6| Ann |##/##/## | # | # | MA101 |
7| Bob |##/##/## | # | # | MA101 |
8| Ann |##/##/## | # | # | CS101 |
Then in sheetB the output would be:
| A | B | C |
+-----------+-------+-------+
1| Course | Total | Unique|
+-----------+-------+-------+
2| MA101 | 6 | 3 | #This would be 3 because only 3 unique students came
3| CS101 | 1 | 1 |
So all courses are listed under A, the total visits for that course are in B, and C is the number of unique students that went for that course.
What I have so far:
In sheetB I have the formulas for A and B.
A2: =unique(transpose(split(ArrayFormula(concatenate('sheetA'!E2:E&" "))," ")))
B2: =arrayformula(if(len(A7:A),countif(transpose(split(ArrayFormula(concatenate('sheetA'!E2:E&" "))," ")),A7:A),iferror(1/0)))
If it helps to look at I put these equations broken up with comments for what I understand each part to do in this gist
I'm trying to figure out what to put in C2, and I'm just totally lost.
Even if anyone knows a better way to do what I did so far, i.e. more concise or something, because those were from another SO post.
YOu can do this easily with native formulas:
THe formulas are:
=UNIQUE(E3:E)
=COUNTIF(E3:E,F2)
=COUNTA(UNIQUE(FILTER(A3:A,E3:E=F2)))

Resources