Fail to build index with large datasets, seems to take infinite time - python-annoy

I used annoy to build an index for 2 million vectors of size 1024. Here is the code that I used.
f = 1024
t = AnnoyIndex(f, 'euclidean')
t.on_disk_build('test.ann')
'''code for adding 2 million vectors'''
t.build(25)
After adding vectors, t.build(25) runs forever. ( I let it to run about six hours and it doesn't finish)
However, the code works fine with 1 million vectors.

Related

Multithreading in HM reference software

Encoding an UHD sequences with HEVC HM reference software takes days on CPU’s even with monster computers, I want to know if it’s possible and then how to increase the number of threads (even if it decreases the quality of the encoding) to speed up the process (I want it to rise up to x4 times at least).
is this possible by increasing number of tiles , because by default there is only one tile per pic, or should we change in the source code? and where exactly?!
seems that the answer to increase encoding speed was not the number of tiles but the WPP.
the HM gives the possibility to increase the number of tile in condition that the min tile with is 4 CTU (4*64 pel) and min height is 1 CTU (64pel). so, u can’t just choose any number .
when you activate the WPP , you can have up to 17 line in same time , but you cannot use WPP and tiles in same time.
testing this with basketballdrive HD seq QP=37 :
T(sec) Rate(kbps) PSNR
1 tile : 171013.381 1761.7472 34.5743
4 tiles : 166401.603 1822.1880 34.5439 = saves about 3 hours
WPP : 166187.201 1785.4048 34.5483 = ~same
could saves more with UHD seq but it's not enough for me. 3h is nothing for JEM and WPP are removed from the new VTM (FVC).

What is the meaning of OneMinuteRate in JMX?

I am trying to calculate the Read/second and Write/Second in my Cassandra 2.1 cluster. After searching and reading, I came to know about JMX bean
org.apache.cassandra.metrics:type=ClientRequest,scope=Write,name=Latency
Here I can see oneMinuteRate. I have started a brand new cluster and started collected these metrics from 0.
When I started my first record, I can see
Count = 1
OneMinuteRate = 0.01599111...
Does it mean that my write/s is 0.0159911?
Or does it mean that based on 1 minute data, my write latency is 0.01599 where Write Latency refers to the response time for writing a record?
Please help me understand the value.
Thanks.
It means that in the last minute, your writes per second were occuring at a rate of .01599 writes per second. Think about it this way: the rate of writes in the last 60 seconds would be
WritesInLastMinute ÷ 60
So in your case
1 ÷ 60 = 0.0166
Or more precisely, .01599.
If you observed no further writes after that, the value would descend down to zero over the next minute.
OneMinuteRate, FiveMinuteRate, and FifteenMinuteRate are exponential moving averages, meaning they are not simply dividing readings against time, instead, as the name implies they take an exponential series of averages as below:
result(t) = (1 - w) * result(t - 1) + (w) * event_this_period
where w is the weighting factor, t is the ticking time, in other words, simply they take 20% or the new reading and 80% of old readings, it's the same way UNIX systems measure CPU loads.
however, if this applies to requests that the server receives, below is a chart from one request to a server, measures taken by dropwizard.
as you can see, from one request a curve is drawn by time, it's really useful to determine trends, but not sure if they are great to monitor live traffic and especially critical one.

Automatically increase number of random cases selected with SPSS syntax and macros

I'm trying to force SPSS to do a psuedo-Monte Carlo study. The real world data is so bizarre that I can't reliably simulate it (if you're interested, it is for testing Injury Severity Scores). As such, I'm using a dataset of about 0.5 million observations of the real world data and then basically bootstrapping the results from increasingly large random samples of it. The goal is to figure out what group sizes are necessary to assume normality (at what group sizes do t-tests and Mann-Whitney U tests reliably agree; in other words, when can I count on the Central Limit Theorem).
My plan is to use a combination of a macro to repeat the two tests 100 times (but run 150 times in case the random selection results in a group size of zero), and then use OMS commands to export the results of the numerous tests into a separate data file.
So far, everything works just fine but, I would like to add another looping command to run the process again but select more random cases. So, it would run 150 times with 10 random cases selected each time, then, after running the first 150, it would run another 150 but select 20 random cases. Optimally, it would be something like this:
Select 10 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
Select 20 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
...
(After running on 200 cases, now increase by 50)
Select 250 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
Select 300 random cases
...
Select 800 random cases
Run a t-test and a Mann-Whitney U test
Repeat 150 times
(Stop after running on 800 cases)
Save all of these results using OMS
Everything in the below syntax works perfectly, except for one small issue, I can't figure out how to have it increase the size of the random sample, and I would prefer not to do that manually.
Even if I have to do it manually, is there a way to append the latest results to the existing file instead of replacing the existing file?
DEFINE !repeater().
!DO !i=1 !TO 150.
*repeat the below processes 150 times
*select a random sample from the dataset
DATASET ACTIVATE DataSet1.
USE ALL.
do if $casenum=1.
compute #s_$_1=10.
compute #s_$_2=565518.
* 565518 is the total number of cases
end if.
do if #s_$_2 > 0.
compute filter_$=uniform(1)* #s_$_2 < #s_$_1.
compute #s_$_1=#s_$_1 - filter_$.
compute #s_$_2=#s_$_2 - 1.
else.
compute filter_$=0.
end if.
VARIABLE LABELS filter_$ 'x random cases (SAMPLE)'.
FORMATS filter_$ (f1.0).
FILTER BY filter_$.
EXECUTE.
*run a non-parametric test
NPAR TESTS
/M-W= issloc BY TwoGroups(0 1)
/MISSING ANALYSIS.
*run a parametric test
T-TEST GROUPS=TwoGroups(0 1)
/MISSING=ANALYSIS
/VARIABLES=issloc
/CRITERIA=CI(.95).
!DOEND.
!ENDDEFINE.
*use OMS to extract the reported descriptives and results from the viewer
*and save them to a file
OMS /SELECT TABLES
/DESTINATION FORMAT = SAV OUTFILE = 'folder/folder/OMS file.sav'
/IF SUBTYPES=['Mann Whitney Ranks' 'Mann Whitney Test Statistics' 'Group Statistics' 'Independent Samples Test']
/COLUMNS SEQUENCE = [RALL CALL LALL].
!repeater.
OMSEND.
Never mind. The answer was so obvious, I missed it entirely. I just needed to define the sample size selection within the macro. *facepalm

How can I generate unique random numbers in quick succession?

I have a function that is called three times in quick succession, and it needs to generate a pseudorandom integer between 1 and 6 on each pass. However I can't manage to get enough entropy out of the function.
I've tried seeding math.randomseed() with all of the following, but there's never enough variation to affect the outcome.
os.time()
tonumber(tostring(os.time()):reverse():sub(1,6))
socket.gettime() * 1000
I've also tried this snippet, but every time my application runs, it generates the same pattern of numbers in the same order. I need different(ish) numbers every time my application runs.
Any suggestions?
Bah, I needed another zero when multiplying socket.gettime(). Multiplied by 10000 there is sufficient distance between the numbers to give me a good enough seed.

Finding the largest prime factor of 600851475143 [duplicate]

This question already has answers here:
Project Euler #3 in Ruby solution times out
(2 answers)
Closed 9 years ago.
I'm trying to use a program to find the largest prime factor of 600851475143. This is for Project Euler here: http://projecteuler.net/problem=3
I first attempted this with this code:
#Ruby solution for http://projecteuler.net/problem=2
#Prepared by Richard Wilson (Senjai)
#We'll keep to our functional style of approaching these problems.
def gen_prime_factors(num) # generate the prime factors of num and return them in an array
result = []
2.upto(num-1) do |i| #ASSUMPTION: num > 3
#test if num is evenly divisable by i, if so add it to the result.
result.push i if num % i == 0
puts "Prime factor found: #{i}" # get some status updates so we know there wasn't a crash
end
result #Implicit return
end
#Print the largest prime factor of 600851475143. This will always be the last value in the array so:
puts gen_prime_factors(600851475143).last #this might take a while
This is great for small numbers, but for large numbers it would take a VERY long time (and a lot of memory).
Now I took university calculus a while ago, but I'm pretty rusty and haven't kept up on my math since.
I don't want a straight up answer, but I'd like to be pointed toward resources or told what I need to learn to implement some of the algorithms I've seen around in my program.
There's a couple problems with your solution. First of all, you never test that i is prime, so you're only finding the largest factor of the big number, not the largest prime factor. There's a Ruby library you can use, just require 'prime', and you can add an && i.prime? to your condition.
That'll fix inaccuracy in your program, but it'll still be slow and expensive (in fact, it'll now be even more expensive). One obvious thing you can do is just set result = i rather than doing result.push i since you ultimately only care about the last viable i you find, there's no reason to maintain a list of all the prime factors.
Even then, however, it's still very slow. The correct program should complete almost instantly. The key is to shrink the number you're testing up to, each time you find a prime factor. If you've found a prime factor p of your big number, then you don't need to test all the way up to the big number anymore. Your "new" big number that you want to test up to is what's left after dividing p out from the big number as many times as possible:
big_number = big_number/p**n
where n is the largest integer such that the right hand side is still a whole number. In practice, you don't need to explicitly find this n, just keep dividing by p until you stop getting a whole number.
Finally, as a SPOILER I'm including a solution below, but you can choose to ignore it if you still want to figure it out yourself.
require 'prime'
max = 600851475143; test = 3
while (max >= test) do
if (test.prime? && (max % test == 0))
best = test
max = max / test
else
test = test + 2
end
end
puts "Here's your number: #{best}"
Exercise: Prove that test.prime? can be eliminated from the if condition. [Hint: what can you say about the smallest (non-1) divisor of any number?]
Exercise: This algorithm is slow if we instead use max = 600851475145. How can it be improved to be fast for either value of max? [Hint: Find the prime factorization of 600851475145 by hand; it's easy to do and it'll make it clear why the current algorithm is slow for this number]

Resources