I'm trying to count the number of items that fit at least one criteria. But my actual formula count 2 instead of 1 when an item fits 2 criteria at the same time.
Considering the following example :
Article | Rate 1 | Rate 2 | Rate 3 | Language
1 | 12% | 54% | 6% | English
2 | 65% | 55% | 34% | English
3 | 59% | 12% | 78% | French
4 | 78% | 8% | 47% | English
5 | 12% | 11% | 35% | English
How do you count the number of article in English with at least one success rate over 50%.
Right now my formula counts 4 instead of 3, because the article 2 counts for 2. (I'm on google sheets)
Thank you for your help.
Best,
Assuming that data is in columns A:E, you could use:
=COUNT(filter(A2:A6,E2:E6="English",(D2:D6>=50%)+(C2:C6>=0.5)+(B2:B6>=0.5)))
=SUMPRODUCT(--(E2:E6="english"), SIGN((B2:B6>0.5)+(C2:C6>0.5)+(D2:D6>0.5)))
Related
My sheet:
+---------+-----------+---------+---------+-----------+
| product | value 1 | value 2 | value 3 | value 4 |
+---------+-----------+---------+---------+-----------+
| name 1 | 700,000 | 500 | 10,000 | 2,000,000 |
+---------+-----------+---------+---------+-----------+
| name 2 | 200,000 | 800 | 20,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 3 | 100,000 | 150 | 6,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 4 | 1,000,000 | 1,000 | 25,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 5 | 2,000,000 | 1,500 | 30,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 6 | 2,500,000 | 3,000 | 65,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 7 | 300,000 | 300 | 12,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 8 | 350,000 | 200 | 9,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 9 | 900,000 | 1,200 | 28,000 | ? |
+---------+-----------+---------+---------+-----------+
| name 10 | 150,000 | 100 | 5,000 | ? |
+---------+-----------+---------+---------+-----------+
What I am attempting is to predict the empty columns based on the data that I do have. Maybe just one of the columns that contain data in every row or maybe I should be only focusing on one column that contains data in every row?
I have used FORECAST previously but had more data in the column that I was predicting values for which the lack of data I think is my root problem(?). Not sure if FORECAST is best for this so any recommendations for other functions are most welcome.
The last thing I can add though is that the known value in column E (value 4) is a confident number and ideally it's used in any formula that I end up with (although I am open to any other recommendations).
The formula I was using:
=FORECAST(D3,E2,$D$2:$D$11)
I don't think this is possible without more information. If you think about it, Value 4 can be a constant (always 2,000,000), be dependent on only one other value (say 200 times value 3), or be a complex formula (say add values 1, 2, and 3 with a constant). Each of these 3 models agree with the values for name 1, however they generate vastly different value 4 predictions.
In the case of name 2, the models would output the following for value 4:
Constant: 2,000,000
Value 3: 8,000,000
Sum: 2,489,700
Each of those values could be valid without providing further constraints (either through data points or specifying the kind of model, but probably both).
I know I how to do this using a custom function/script but I am wondering if it can be done with a built-in formula.
I have a list of tasks with a start date and end date. I want to calculate the actual # of working days (NETWORKDAYS) spent on all the tasks.
Task days may overlap so I can't just total the # of days spent on each task
There may be gaps between tasks so I can't just find the difference between the first start and last end.
For example, let's use these:
| Task Name | Start Date | End Date | NETWORKDAYS |
|:---------:|------------|------------|:-----------:|
| A | 2019-09-02 | 2019-09-04 | 3 |
| B | 2019-09-03 | 2019-09-09 | 5 |
| C | 2019-09-12 | 2019-09-13 | 2 |
| D | 2019-09-16 | 2019-09-17 | 2 |
| E | 2019-09-19 | 2019-09-23 | 3 |
Here it is visually:
Now:
If you total the NETWORKDAYS you'll get 15
If you calculate NETWORKDAYS between 2019-09-02 and 2019-09-23 you get 16
But the actual duration is 13:
A and B overlap a bit
There is a gap between B and C
There is a gap between D and E
If I was to write a custom function I would basically take all the dates, sort them, find overlaps and remove them, and account for gaps.
But I am wondering if there is a way to calculate the actual duration using built-in formulas?
sure, why not:
=ARRAYFORMULA(COUNTA(IFERROR(QUERY(UNIQUE(TRANSPOSE(SPLIT(CONCATENATE("×"&
SPLIT(REPT(INDIRECT("B1:B"&COUNTA(B1:B))&"×",
NETWORKDAYS(INDIRECT("B1:B"&COUNTA(B1:B)), INDIRECT("C1:C"&COUNTA(B1:B)))), "×")+
TRANSPOSE(ROW(INDIRECT("A1:A"&MAX(NETWORKDAYS(B1:B, C1:C))))-1)), "×"))),
"where Col1>4000", 0))))
I have a Google sheet with data of different players attacks and their corresponding damage.
Sheet1
| Player | Attack | Damage |
|:------------|:-----------:|------------:|
| Iron Man | Melee | 50 |
| Iron Man | Missile | 2500 |
| Iron Man | Unibeam | 100 |
| Superman | Melee | 9000 |
| Superman | Breath | 200 |
| Superman | Laser | 1500 |
In my second sheet, I want to list each player and display their best attack and the corresponding damage. Like this:
Sheet2
| Player | Best attack | Damage |
|:------------|:-----------:|------------:|
| Iron Man | Missile | 2500 |
| Superman | Melee | 9000 |
I have tried to add the following in the damage column (third column) of Sheet2:
=MAX(IF(Sheet1!A:A=A2;Sheet1!C:C))
But I get 9000 for Superman and 0 for Iron Man. For best attack (second column) I guess MAX should be used together with VLOOKUP, but I don't know how to apply it.
Edit:
=ArrayFormula(MAX(IF(Sheet1!A:A=A3;Sheet1!C:C))) seems to fix the first issue. Getting correct values in the damage column (third column). But still don't know how to apply this to return which is the best attack.
You could use Filter.
Damage:
=MAX(FILTER(Sheet1!C:C,Sheet1!A:A=A2))
Then Best Attack:
=JOIN(",",FILTER(Sheet1!B:B,Sheet1!A:A=A2,Sheet1!C:C=C2))
The Join will join two or more if there are more attacks with the same damage.
I am considering the range A2:C.
Try this formula.
=sortn(sort(A2:C,3,0),9^9,2,1,0)
Screenshot
We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.
here's my sample DB: http://console.neo4j.org/r/plb1ez
It contains 2 categories with the name "Digital Cameras" with this query, I group them by name and return significance*view_count for each of the category names:
MATCH a
WHERE a.metatype = "Category"
RETURN DISTINCT a.categoryName, SUM(a.significance * a.view_count) AS popularity
However, what I actually need is not the absolute popularity (=significance*view_count), but the relative one - so I need my query to additionally return the sum of all popularities (should be 1500 according to my math), so I can calculate the fraction (which I call "relativePopularity") for each category (popularity/grandTotal).
Desired result:
| d.categoryName | popularity | grandTotal | relativePopularity |
| Digital Compacts | 300 | 1500 | 0.2 |
| Hand-held Camcorders | 300 | 1500 | 0.2 |
| Digital SLR | 150 | 1500 | 0.1 |
| Digital Cameras | 750 | 1500 | 0.5 |
+----------------------+------------+------------+--------------------+
Currently I'm doing this calculation with two ansynchronous jobs, but I need it done in one go.
Thanks for your help!