Mapping timeseries+static information into an ML model (XGBoost) - machine-learning

So lets say I have multiple probs, where one prob has two input DataFrames:
Input:
One constant stream of data (e.g. from a sensor) Second step: Multiple streams from multiple sensors
> df_prob1_stream1
timestamp | ident | measure1 | measure2 | total_amount |
----------------------------+--------+--------------+----------+--------------+
2019-09-16 20:00:10.053174 | A | 0.380 | 0.08 | 2952618 |
2019-09-16 20:00:00.080592 | A | 0.300 | 0.11 | 2982228 |
... (1 million more rows - until a pre-defined ts) ...
One static DataFrame of information, mapped to an unique identifier called ident, which needs to be assigned to the ident column in each df_probX_streamX in order to let the system recognize, that this data is related.
> df_global
ident | some1 | some2 | some3 |
--------+--------------+----------+--------------+
A | LARGE | 8137 | 1 |
B | SMALL | 1234 | 2 |
Output:
A binary classifier [0,1]
So how can I suitable train XGBoost to be able to make the best usage of one timeseries DataFrame in combination with one static DataFrame (containg additional context information) in one prob? Any help would be appreciated.

Related

SPSS: How would I create a column summing up means / medians / range from Compare Means function?

I'm trying to sum up across a row for different numerical variables that have been processed through the Compare Means function.
Below (without the last 'Total' column') is what I have generated from Compare Means; I'm looking to generate the last Total column.
+--------+-------+-------+-------+-------+
| | Var 1 | Var 2 | Var 3 | Total |
+--------+-------+-------+-------+-------+
| Mean | 10 | 1 | 2 | |
| Median | 4 | 20 | 4 | |
| Range | 6 | 40 | 1 | |
| Std.dev| 3 | 3 | 3 | |
+--------+-------+-------+-------+-------+
Here's the syntax of my command:
MEANS TABLES=VAR_1 VAR_2 VAR_3
/CELLS=MEAN STDDEV MEDIAN RANGE.
Can't really imagine what the use is for summing these values, but forget about why - this is how:
The OMS command takes results from the output and puts them in a new dataset which you can then further analyse, as you requested.
DATASET DECLARE MyResults.
OMS /SELECT TABLES /IF COMMANDS=['Means'] SUBTYPES=['Report'] /DESTINATION FORMAT=SAV OUTFILE='MyResults' .
* now your original code.
MEANS TABLES=VAR_1 VAR_2 VAR_3 /CELLS=MEAN STDDEV MEDIAN RANGE.
* now your results are captured - we'll go see them.
omsend.
dataset activate MyResults.
* the results are now in a new dataset, which you can analyse.
compute total=sum(VAR_1, VAR_2, VAR_3).
exe.

How to forecast (or any other function) in Google Sheets with only one cell of data?

My sheet:
+---------+-----------+---------+---------+-----------+
| product | value 1 | value 2 | value 3 | value 4 |
+---------+-----------+---------+---------+-----------+
| name 1 | 700,000 | 500 | 10,000 | 2,000,000 |
+---------+-----------+---------+---------+-----------+
| name 2 | 200,000 | 800 | 20,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 3 | 100,000 | 150 | 6,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 4 | 1,000,000 | 1,000 | 25,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 5 | 2,000,000 | 1,500 | 30,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 6 | 2,500,000 | 3,000 | 65,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 7 | 300,000 | 300 | 12,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 8 | 350,000 | 200 | 9,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 9 | 900,000 | 1,200 | 28,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 10 | 150,000 | 100 | 5,000 | ? |
+---------+-----------+---------+---------+-----------+
What I am attempting is to predict the empty columns based on the data that I do have. Maybe just one of the columns that contain data in every row or maybe I should be only focusing on one column that contains data in every row?
I have used FORECAST previously but had more data in the column that I was predicting values for which the lack of data I think is my root problem(?). Not sure if FORECAST is best for this so any recommendations for other functions are most welcome.
The last thing I can add though is that the known value in column E (value 4) is a confident number and ideally it's used in any formula that I end up with (although I am open to any other recommendations).
The formula I was using:
=FORECAST(D3,E2,$D$2:$D$11)
I don't think this is possible without more information. If you think about it, Value 4 can be a constant (always 2,000,000), be dependent on only one other value (say 200 times value 3), or be a complex formula (say add values 1, 2, and 3 with a constant). Each of these 3 models agree with the values for name 1, however they generate vastly different value 4 predictions.
In the case of name 2, the models would output the following for value 4:
Constant: 2,000,000
Value 3: 8,000,000
Sum: 2,489,700
Each of those values could be valid without providing further constraints (either through data points or specifying the kind of model, but probably both).

Find (and return) range of cells containing formulae

I am working on a table where I track down a Meter value on daily basis.
Since it's not really daily updated, I use a formula to forecast the Meter value.
Therefore the forecasted trend between the (variable) days of a manual input is not 100% reliable and made a formula where # m³ was calculated only if not ISFORMULA(B#) (see below), in order to read in column m³ the real usage between those days.
BUT now I came to the idea to autopopulate the m³ cells between the range with an auto-calculated daily average between the days where two values have been inserted manually.
So i came up with this draft version for column C
for C11 =if(isformula(B11),if(isblank(C12),,if(isformula(B12),C12,"RANGE_DIFF/DAYS")),RANGE_DIFF)
where, in this case, RANGE_DIFF should beB12-B9 and DAYS = 3
In other words, how do I determine and return the range of cells between two manually inserted values?
Link to sheet
Thanks in advance :)
+----------+-----------+----------+
| Date | Meter | m³ |
+----------+-----------+----------+
| ... | ... | ... |
| 09/1 | 9,381.296 | 0.75 m³ | <<< MANUAL INSERTION | =IF(ISFORMULA(C9),,C9-C8)
| 10/1 | 9,382.622 | | <<< =TREND(B3:B9,A3:A9,A10) | * =C11
| 11/1 | 9,383.955 | | <<< =TREND(B4:B10,A4:A10,A11) | * =RANGE_DIFF/DAYS
| 12/1 | 9,385.197 | ??? | <<< MANUAL INSERTION | *** RANGE_DIFF >>> =B12-B9
| 13/1 | 9,386.350 | 1.15 m³ | <<< MANUAL INSERTION | =IF(ISFORMULA(C13),,C13-C12)
| ... | ... | ... |
+----------+-----------+----------+

How to set more conditions (targets) in the Time Series Node in SPSS Modeler?

could you please advise if it is possible to calculate the predictions in SPSS Modeler when having two conditions for the model
i.e. we need to calculate the future values for the respective ID and at the same time we need to see the split per Var1.
So far we have used Time Series node but there we set just one target Value (currency1). Would it be please possible to have the output in the format that we have one figure as the prediction for the respective ID and having there also Var1 reflected in the split. We need this split per Var1 as one ID has more values in Var1, so it is not the case as in Var3 where we have just one value assigned to the ID.
ID | Value (currency1) | Value (currency2) | Period | Var1 | Var2 | Var3
---------------------------------------------------------------------------
U1 | 1000 | 1200 | 1/1/2000 | 100 | abc | 1p1
U1 | 500 | 600 | 2/1/2000 | 100 | abc | 1p1
U1 | 700 | 840 | 3/1/2000 | 200 | def | 1p1
U2 | 500 | 600 | 1/1/2000 | 100 | ghj | 1p2
U2 | 800 | 960 | 4/1/2000 | 300 | abc | 1p2
Thank you very much in advance for any help / advice.

Neo4j CSV import query super slow, when setting relationships

I am trying to evaluate Neo4j (using the community version).
I am importing some data (1 million rows) using the LOAD CSV process. It needs to match previously imported nodes to create a relationship between them.
Here is my query:
//Query #3
//create edges between Tr and Ad nodes
USING PERIODIC COMMIT
LOAD CSV WITH HEADERS FROM 'file:///1M.txt'
AS line
FIELDTERMINATOR '\t'
//find appropriate tx and ad
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
//create the edge (relationship)
CREATE (tx)-[out:OUT_TO]->(ad)
//set properties on the edge
SET out.id= TOINT(line.id)
SET out.n = TOINT(line.n)
SET out.v = TOINT(line.v)
I have indicies on:
Indexes
ON :Ad(p58) ONLINE (for uniqueness constraint)
ON :Tr(txid) ONLINE
ON :Tr(h) ONLINE (for uniqueness constraint)
This query has been running for 5 days now and it has so far created 270K relationships (out of 1M).
Java heap is 4g
Machine has 32G of RAM and an SSD for a drive, only running linux and Neo4j
Any hints to speed this process up would be highly appreciated.
Should I try the enterprise edition?
Query Plan:
+--------------------------------------------+
| No data returned, and nothing was changed. |
+--------------------------------------------+
If a part of a query contains multiple disconnected patterns,
this will build a cartesian product between all those parts.
This may produce a large amount of data and slow down query processing.
While occasionally intended,
it may often be possible to reformulate the query that avoids the use of this cross product,
perhaps by adding a relationship between the different parts or by using OPTIONAL MATCH (identifier is: (ad))
20 ms
Compiler CYPHER 3.0
Planner COST
Runtime INTERPRETED
+---------------------------------+----------------+---------------------+----------------------------+
| Operator | Estimated Rows | Variables | Other |
+---------------------------------+----------------+---------------------+----------------------------+
| +ProduceResults | 1 | | |
| | +----------------+---------------------+----------------------------+
| +EmptyResult | | | |
| | +----------------+---------------------+----------------------------+
| +Apply | 1 | line -- ad, out, tx | |
| |\ +----------------+---------------------+----------------------------+
| | +SetRelationshipProperty(4) | 1 | ad, out, tx | |
| | | +----------------+---------------------+----------------------------+
| | +CreateRelationship | 1 | out -- ad, tx | |
| | | +----------------+---------------------+----------------------------+
| | +ValueHashJoin | 1 | ad -- tx | ad.p58; line.p58 |
| | |\ +----------------+---------------------+----------------------------+
| | | +NodeIndexSeek | 1 | tx | :Tr(txid) |
| | | +----------------+---------------------+----------------------------+
| | +NodeUniqueIndexSeek(Locking) | 1 | ad | :Ad(p58) |
| | +----------------+---------------------+----------------------------+
| +LoadCSV | 1 | line | |
+---------------------------------+----------------+---------------------+----------------------------+
OKAY, so by splitting the MATCH statement into two it sped up the query immensely. Thanks #William Lyon for pointing me to the Plan. I noticed the warning.
old MATCH atatement
MATCH (tx:Tr { txid: TOINT(line.txid) }), (ad:Ad {p58: line.p58})
split into two:
MATCH (tx:Tr { txid: TOINT(line.txid) })
MATCH (ad:Ad {p58: line.p58})
on 750K relationships the query took 83 seconds.
Next up 9 Million CSV LOAD

Resources