Cypher / neo4j: sum by value and generate grand total - neo4j

here's my sample DB: http://console.neo4j.org/r/plb1ez
It contains 2 categories with the name "Digital Cameras" with this query, I group them by name and return significance*view_count for each of the category names:
MATCH a
WHERE a.metatype = "Category"
RETURN DISTINCT a.categoryName, SUM(a.significance * a.view_count) AS popularity
However, what I actually need is not the absolute popularity (=significance*view_count), but the relative one - so I need my query to additionally return the sum of all popularities (should be 1500 according to my math), so I can calculate the fraction (which I call "relativePopularity") for each category (popularity/grandTotal).
Desired result:
| d.categoryName | popularity | grandTotal | relativePopularity |
| Digital Compacts | 300 | 1500 | 0.2 |
| Hand-held Camcorders | 300 | 1500 | 0.2 |
| Digital SLR | 150 | 1500 | 0.1 |
| Digital Cameras | 750 | 1500 | 0.5 |
+----------------------+------------+------------+--------------------+
Currently I'm doing this calculation with two ansynchronous jobs, but I need it done in one go.
Thanks for your help!

Related

Classification with Integers and Types

Let's say we have the following dataset
Label | Features |
-----------------------------------
Age | Size | Weight | shoeSize |
20 | 180 | 80 | 42 |
40 | 173 | 56 | 38 |
as i know features in machine learning should be normalized and the ones mentioned above can be normalized really good. but what if i want to extend the feature list for for example the following features
| Gender | Ethnicity |
| 0 | 1 |
| 1 | 2 |
| 0 | 3 |
| 0 | 2 |
where the Gender values 0 and 1 are for female and male. and the Ethnicity values 1, 2 and 3 are for asian, hispanic and european. since these values reference types i am note sure if they can be normalized.
if they can not be normalized how can i handle mixing values like the size with types like the enthnicity.

How to forecast (or any other function) in Google Sheets with only one cell of data?

My sheet:
+---------+-----------+---------+---------+-----------+
| product | value 1 | value 2 | value 3 | value 4 |
+---------+-----------+---------+---------+-----------+
| name 1 | 700,000 | 500 | 10,000 | 2,000,000 |
+---------+-----------+---------+---------+-----------+
| name 2 | 200,000 | 800 | 20,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 3 | 100,000 | 150 | 6,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 4 | 1,000,000 | 1,000 | 25,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 5 | 2,000,000 | 1,500 | 30,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 6 | 2,500,000 | 3,000 | 65,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 7 | 300,000 | 300 | 12,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 8 | 350,000 | 200 | 9,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 9 | 900,000 | 1,200 | 28,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 10 | 150,000 | 100 | 5,000 | ? |
+---------+-----------+---------+---------+-----------+
What I am attempting is to predict the empty columns based on the data that I do have. Maybe just one of the columns that contain data in every row or maybe I should be only focusing on one column that contains data in every row?
I have used FORECAST previously but had more data in the column that I was predicting values for which the lack of data I think is my root problem(?). Not sure if FORECAST is best for this so any recommendations for other functions are most welcome.
The last thing I can add though is that the known value in column E (value 4) is a confident number and ideally it's used in any formula that I end up with (although I am open to any other recommendations).
The formula I was using:
=FORECAST(D3,E2,$D$2:$D$11)
I don't think this is possible without more information. If you think about it, Value 4 can be a constant (always 2,000,000), be dependent on only one other value (say 200 times value 3), or be a complex formula (say add values 1, 2, and 3 with a constant). Each of these 3 models agree with the values for name 1, however they generate vastly different value 4 predictions.
In the case of name 2, the models would output the following for value 4:
Constant: 2,000,000
Value 3: 8,000,000
Sum: 2,489,700
Each of those values could be valid without providing further constraints (either through data points or specifying the kind of model, but probably both).

COUNTIFS with OR but no SUM

I'm trying to count the number of items that fit at least one criteria. But my actual formula count 2 instead of 1 when an item fits 2 criteria at the same time.
Considering the following example :
Article | Rate 1 | Rate 2 | Rate 3 | Language
1 | 12% | 54% | 6% | English
2 | 65% | 55% | 34% | English
3 | 59% | 12% | 78% | French
4 | 78% | 8% | 47% | English
5 | 12% | 11% | 35% | English
How do you count the number of article in English with at least one success rate over 50%.
Right now my formula counts 4 instead of 3, because the article 2 counts for 2. (I'm on google sheets)
Thank you for your help.
Best,
Assuming that data is in columns A:E, you could use:
=COUNT(filter(A2:A6,E2:E6="English",(D2:D6>=50%)+(C2:C6>=0.5)+(B2:B6>=0.5)))
=SUMPRODUCT(--(E2:E6="english"), SIGN((B2:B6>0.5)+(C2:C6>0.5)+(D2:D6>0.5)))

Description matching in record linkage using Machine learning Approach

We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.

How to set more conditions (targets) in the Time Series Node in SPSS Modeler?

could you please advise if it is possible to calculate the predictions in SPSS Modeler when having two conditions for the model
i.e. we need to calculate the future values for the respective ID and at the same time we need to see the split per Var1.
So far we have used Time Series node but there we set just one target Value (currency1). Would it be please possible to have the output in the format that we have one figure as the prediction for the respective ID and having there also Var1 reflected in the split. We need this split per Var1 as one ID has more values in Var1, so it is not the case as in Var3 where we have just one value assigned to the ID.
ID | Value (currency1) | Value (currency2) | Period | Var1 | Var2 | Var3
---------------------------------------------------------------------------
U1 | 1000 | 1200 | 1/1/2000 | 100 | abc | 1p1
U1 | 500 | 600 | 2/1/2000 | 100 | abc | 1p1
U1 | 700 | 840 | 3/1/2000 | 200 | def | 1p1
U2 | 500 | 600 | 1/1/2000 | 100 | ghj | 1p2
U2 | 800 | 960 | 4/1/2000 | 300 | abc | 1p2
Thank you very much in advance for any help / advice.

Resources