I'm trying to sum up across a row for different numerical variables that have been processed through the Compare Means function.
Below (without the last 'Total' column') is what I have generated from Compare Means; I'm looking to generate the last Total column.
+--------+-------+-------+-------+-------+
| | Var 1 | Var 2 | Var 3 | Total |
+--------+-------+-------+-------+-------+
| Mean | 10 | 1 | 2 | |
| Median | 4 | 20 | 4 | |
| Range | 6 | 40 | 1 | |
| Std.dev| 3 | 3 | 3 | |
+--------+-------+-------+-------+-------+
Here's the syntax of my command:
MEANS TABLES=VAR_1 VAR_2 VAR_3
/CELLS=MEAN STDDEV MEDIAN RANGE.
Can't really imagine what the use is for summing these values, but forget about why - this is how:
The OMS command takes results from the output and puts them in a new dataset which you can then further analyse, as you requested.
DATASET DECLARE MyResults.
OMS /SELECT TABLES /IF COMMANDS=['Means'] SUBTYPES=['Report'] /DESTINATION FORMAT=SAV OUTFILE='MyResults' .
* now your original code.
MEANS TABLES=VAR_1 VAR_2 VAR_3 /CELLS=MEAN STDDEV MEDIAN RANGE.
* now your results are captured - we'll go see them.
omsend.
dataset activate MyResults.
* the results are now in a new dataset, which you can analyse.
compute total=sum(VAR_1, VAR_2, VAR_3).
exe.
I'm trying to count the number of items that fit at least one criteria. But my actual formula count 2 instead of 1 when an item fits 2 criteria at the same time.
Considering the following example :
Article | Rate 1 | Rate 2 | Rate 3 | Language
1 | 12% | 54% | 6% | English
2 | 65% | 55% | 34% | English
3 | 59% | 12% | 78% | French
4 | 78% | 8% | 47% | English
5 | 12% | 11% | 35% | English
How do you count the number of article in English with at least one success rate over 50%.
Right now my formula counts 4 instead of 3, because the article 2 counts for 2. (I'm on google sheets)
Thank you for your help.
Best,
Assuming that data is in columns A:E, you could use:
=COUNT(filter(A2:A6,E2:E6="English",(D2:D6>=50%)+(C2:C6>=0.5)+(B2:B6>=0.5)))
=SUMPRODUCT(--(E2:E6="english"), SIGN((B2:B6>0.5)+(C2:C6>0.5)+(D2:D6>0.5)))
I'm running a test:
client = Client('127.0.0.1:8786')
def x(i):
return {}
while True:
start = time.time()
a = client.submit(randint(0,1000000))
res = a.result()
del a
end = time.time()
print("Ran on %s with res %s" % (end-start, res))
client.shutdown()
del client
I used it (with more code) to get an estimate of my queries performance. But for this example I've removed all things I could think of.
The above code leaks roughly 0.1 MB per second, which I would guesstimate to roughly 0.3MB per 1000 calls.
Am I doing something wrong in my code?
My python debugging skills are a bit rusty (and with a bit I mean I last used objgraph on Orbited (the precursor to websockets) in 2009 https://pypi.python.org/pypi/orbited) but from what I can see, checking to number of references before and after:
Counting objects in the scheduler, before and after using objgraph.show_most_common_types()
| What | Before | After | Diff |
|-------------+------------------+--------|---------+
| function | 33318 | 33399 | 81 |
| dict | 17988 | 18277 | 289 |
| tuple | 16439 | 28062 | 11623 |
| list | 10926 | 11257 | 331 |
| OrderedDict | N/A | 7168 | 7168|
It's not a huge number of RAM in any case, but digging deeper I found that t scheduler._transition_counter is 11453 and scheduler.transition_log is filled with:
('x-25ca747a80f8057c081bf1bca6ddd481', 'released', 'waiting',
OrderedDict([('x-25ca747a80f8057c081bf1bca6ddd481', 'processing')]), 4121),
('x-25ca747a80f8057c081bf1bca6ddd481', 'waiting', 'processing', {}, 4122),
('x-25cb592650bd793a4123f2df39a54e29', 'memory', 'released', OrderedDict(), 4123),
('x-25cb592650bd793a4123f2df39a54e29', 'released', 'forgotten', {}, 4124),
('x-25ca747a80f8057c081bf1bca6ddd481', 'processing', 'memory', OrderedDict(), 4125),
('x-b6621de1a823857d2f206fbe8afbeb46', 'released', 'waiting', OrderedDict([('x-b6621de1a823857d2f206fbe8afbeb46', 'processing')]), 4126)
First error on my part
Which of course led me to realise the first error on my part was not configuring transition-log-length.
After setting configuration transition-log-length to 10:
| What | Before | After | Diff |
| ---------------+----------+--------+---------|
| function | 33323 | 33336 | 13 |
| dict | 17987 | 18120 | 133 |
| tuple | 16530 | 16342 | -188 |
| list | 10928 | 11136 | 208 |
| _lru_list_elem | N/A | 5609 | 5609 |
A quick google found that _lru_list_elem is made by #functools.lru_cache which in turn is in invoked in key_split (in distributed/utils.py)
Which is the LRU cache, of up to 100 000 items.
Second try
Based on the code it appears as Dask should climb up to roughly 10k _lru_list_elem
After running my script again and watching the memory it climbs quite fast up until I approach 100k _lru_list_elem, afterwards it stops climbing almost entirely.
This appears to be the case, since it pretty much flat-lines after 100k
So no leak, but fun to get hands dirty on Dask source code and Python memory profilers
For diagnostic, logging, and performance reasons the Dask scheduler keeps records on many of its interactions with workers and clients in fixed-sized deques. These records do accumulate, but only to a finite extent.
We also try to ensure that we don't keep around anything that would be too large.
Seeing memory use climb up until a nice round number like what you've seen and then stay steady seems to be consistent with this.
We are working on record linkage project.
In simple terms, we are searching product in database just by looking at the similarity of description. It is a very interesting problem to solve, but currently the machine learning approach, what we have adopted is resulting in very low accuracy. If you can suggest something very lateral approach it will help our project a lot.
Input description
+-----+----------------------------------------------+
| ID | description |
-+----|----------------------------------------------+
| 1 |delta t17267-ss ara 17 series shower trim ss |
| 2 |delta t14438 chrome lahara tub shower trim on |
| 3 |delta t14459 trinsic tub/shower trim |
| 4 |delta t17497 cp cassidy tub/shower trim only |
| 5 |delta t14497-rblhp cassidy tub & shower trim |
| 6 |delta t17497-ss cassidy 17 series tub/shower |
-+---------------------------------------------------+
Description in Database
+---+-----------------------------------------------------------------------------------------------------+
|ID | description |
----+-----------------------------------------------------------------------------------------------------+
| 1 | delta monitor17 ara® shower trim 2 gpm 1 lever handle stainless commercial |
| 2 | delta monitor 14 lahara® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 3 | delta monitor 14 trinsic® tub and shower trim 2 gpm 1 handle chrome plated residential |
| 4 | delta monitor17 addison™ tub and shower trim 2 gpm 1 handle chrome plated domestic residential|
| 5 | delta monitor 14 cassidy™ tub and shower trim 2 gpm venetian bronze |
| 6 | delta monitor 17 addison™ tub and shower trim 2 gpm 1 handle stainless domestic residential |
+---+-----------------------------------------------------------------------------------------------------+
Background information
1.The records in database are fundamentally very near because of which it causing huge issue.
2.There are around 2 million records in database, but search space gets reduced when we search for specific manufacturer the search space gets reduced to few hundreds.
3.The records in “Input description” with records ID 1 is same as the record in “Description in Database” with record ID 1( That we know using manual approach.)
4.we are used random forest train to predict.
Current approach
We are tokenized the description
Remove stopwords
Added abbreviation information
For each record pair we calculate scores from different string metric like jacard, sorendice, cosine, average of all this scores are calculated.
Then we calculate the score for manufacturer Id using jaro winker metric method.
So if there are 5 records of a manufacturer in “input description” and 10 records for a manufacturer in “database” the total combination is 50 records pairs that is 10 pairs per record, which results in scores which are very near. We have considered top 4 record pair from each set of 10 pairs. In the case for a record pair, where there is similar score for more than one record pair, we have considered all of them.
7.We arrive at the following learning data set format.
|----------------------------------------------------------+---------------------------- +--------------+-----------+
|ISMatch | Descrption average score |manufacturer ID score| jacard score of description | sorensenDice | cosine(3) |
|-------------------------------------------------------------------------------------------------------------------
|1 | 1:0.19 | 2:0.88 |3:0.12 | 4:0.21 | 5:0.23 |
|0 | 1:0.14 |2:0.66 |3:0.08 | 4:0.16 | 5:0.17 |
|0 | 1:0.14 |2:0.68 |3:0.08 |4:0.15 | 5:0.19 |
|0 | 1:0.14 |2:0.58 |3:0.08 |4:0.16 | 5:0.16 |
|0 | 1:0.12 |2:0.55 |3:0.08 |4:0.14 | 5:0.14 |
|--------+--------------------------+----------------------+--------------------------------------------+-----------+
We train the above dataset. When predict it in real time using the same approach the accuracy is very low.
Please suggest any other alternative approach,
we planned to use TF-IDF but initial investigation reveals it also may not improve the accuracy by huge terms.
here's my sample DB: http://console.neo4j.org/r/plb1ez
It contains 2 categories with the name "Digital Cameras" with this query, I group them by name and return significance*view_count for each of the category names:
MATCH a
WHERE a.metatype = "Category"
RETURN DISTINCT a.categoryName, SUM(a.significance * a.view_count) AS popularity
However, what I actually need is not the absolute popularity (=significance*view_count), but the relative one - so I need my query to additionally return the sum of all popularities (should be 1500 according to my math), so I can calculate the fraction (which I call "relativePopularity") for each category (popularity/grandTotal).
Desired result:
| d.categoryName | popularity | grandTotal | relativePopularity |
| Digital Compacts | 300 | 1500 | 0.2 |
| Hand-held Camcorders | 300 | 1500 | 0.2 |
| Digital SLR | 150 | 1500 | 0.1 |
| Digital Cameras | 750 | 1500 | 0.5 |
+----------------------+------------+------------+--------------------+
Currently I'm doing this calculation with two ansynchronous jobs, but I need it done in one go.
Thanks for your help!