I need to define a place to log/verify number of access with any Ip. If number of any Ip is greater than ten times per minute Ip request will be denied.
This script is created to be able to verify this condition. If ip does no exist Lua script will create new counter to this ip with 60 seconds TTLL. If it exists it will increase it and it validates if the counter is greater than ten for this ip.
KEYS[1]==163.2.2.2
if redis.call("EXISTS",KEYS[1]) == 1 then
local ocurrences=redis.call("INCR",KEYS[1])
if ocurrences>10 then
return true
else
return false
end
else
redis.call("SETEX",KEYS[1],60,1)
return false
end
It works fine but Lua script or transactions on Lua is blocking Redis to be able to validate it optimist locking. Which would be the best way to do it with redis without blocking Redis without race problem with read/write access?
In Redis, (almost) all commands block the server, Lua script evaluation included. That said, keep in mind the your server can still cater to a lot of requests while ensuring their isolation.
Lua scripts do not require optimism and as long as they are simple enough - like yours - are a good choice. IMO this script will do as intended for basic rate limiting.
Related
EDIT:
My question was horrifically put so I delete it and rephrase entirely here.
I'll give a tl;dr:
I'm trying to assign each computation to a designated worker that fits the computation type.
In long:
I'm trying to run a simulation, so I represent it using a class of the form:
Class Simulation:
def __init__(first_Client: Client, second_Client: Client)
self.first_client = first_client
self.second_client = second_client
def first_calculation(input):
with first_client.as_current():
return output
def second_calculation(input):
with second_client.as_current():
return output
def run(input):
return second_calculation(first_calculation(input))
This format has downsides like the fact that this simulation object is not pickleable.
I could edit the Simulation object to contain only addresses and not clients for example, but I feel as if there must be a better solution. For instance, I would like the simulation object to work the following way:
Class Simulation:
def first_calculation(input):
client = dask.distributed.get_client()
with client.as_current():
return output
...
Thing is, the dask workers best fit for the first calculation, are different than the dask workers best fit for the second calculation, which is the reason my Simulation object has two clients that connect to tow different schedulers to begin with. Is there any way to make it so there is only one client but two types of schedulers and to make it so the client knows to run the first_calculation to the first scheduler and the second_calculation to the second one?
Dask will chop up large computations in smaller tasks that can run in paralell. Those tasks will then be submitted by the client to the scheduler which in turn wil schedule those tasks on the available workers.
Sending the client object to a Dask scheduler will likely not work due to the serialization issue you mention.
You could try one of two approaches:
Depending on how you actually run those worker machines, you could specify different types of workers for different tasks. If you run on kubernetes for example you could try to leverage the node pool functionality to make different worker types available.
An easier approach using your existing infrastructure would be to return the results of your first computation back to the machine from which you are using the client using something like .compute(). And then use that data as input for the second computation. So in this case you're sending the actual data over the network instead of the client. If the size of that data becomes an issue you can always write the intermediary results to something like S3.
Dask does support giving specific tasks to specific workers with annotate. Here's an example snippet, where a delayed_sum task was passed to one worker and the doubled task was sent to the other worker. The assert statements check that those workers really were restricted to only those tasks. With annotate you shouldn't need separate clusters. You'll also need the most recent versions of Dask and Distributed for this to work because of a recent bug fix.
import distributed
import dask
from dask import delayed
local_cluster = distributed.LocalCluster(n_workers=2)
client = distributed.Client(local_cluster)
workers = list(client.scheduler_info()['workers'].keys())
with dask.annotate(workers=workers[0]):
delayed_sum = delayed(sum)([1, 2])
with dask.annotate(workers=workers[1]):
doubled = delayed_sum * 2
# use persist so scheduler doesn't clean up
# wrap in a distributed.wait to make sure they're there when we check the scheduler
distributed.wait([doubled.persist(), delayed_sum.persist()])
worker_restrictions = local_cluster.scheduler.worker_restrictions
assert worker_restrictions[delayed_sum.key] == {workers[0]}
assert worker_restrictions[doubled.key] == {workers[1]}
I am what I now consider part 3 of completing a task of pinging a very large list of URLs (which number in the thousands) and retrieving a URL's x509 certificate associated with it. Part 1 is here (How do I properly use threads to ping a URL) and Part 2 is here (Why won't my connection pool implement my thread code).
Since I asked these two questions, I have now ended up with the following code:
###### This is the code that pings a url and grabs its x509 cert #####
class SslClient
attr_reader :url, :port, :timeout
def initialize(url, port = '443')
#url = url
#port = port
end
def ping_for_certificate_info
context = OpenSSL::SSL::SSLContext.new
tcp_client = TCPSocket.new(url, port)
ssl_client = OpenSSL::SSL::SSLSocket.new tcp_client, context
ssl_client.hostname = url
ssl_client.sync_close = true
ssl_client.connect
certificate = ssl_client.peer_cert
verify_result = ssl_client.verify_result
tcp_client.close
{certificate: certificate, verify_result: verify_result }
rescue => error
{certificate: nil, verify_result: nil }
end
end
The above code is paramount that I retrieve the ssl_client.peer_cert. Below I have the following code that is the snippet that makes multiple HTTP pings to URLs for their certs:
pool = Concurrent::CachedThreadPool.new
pool.post do
[LARGE LIST OF URLS TO PING].each do |struct|
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_for_termination
#Do some rails code with the database depending on the results.
So far when I run this code, it is unbelievably slow. I thought that by creating a thread pool with threads, the code would go much faster. That doesn't seem the case and I'm not sure why. A lot of it was because I didn't know the nuances of threads, pools, starvation, locks, etc. However, after implementing the above code, I read some more to try to speed it up and once again I'm confused and could use some clarification as to how I can make the code faster.
For starters, in this excellent article here (ruby-concurrency-parallelism) . We get the following definitions and concepts:
Concurrency vs. Parallelism
These terms are used loosely, but they do have distinct meanings.
Concurrency: The art of doing many tasks, one at a time. By switching
between them quickly, it may appear to the user as though they happen
simultaneously. Parallelism: Doing many tasks at literally the same
time. Instead of appearing simultaneous, they are simultaneous.
Concurrency is most often used for applications that are IO heavy. For
example, a web app may regularly interact with a database or make lots
of network requests. By using concurrency, we can keep our application
responsive, even while we wait for the database to respond to our
query.
This is possible because the Ruby VM allows other threads to run while
one is waiting during IO. Even if a program has to make dozens of
requests, if we use concurrency, the requests will be made at
virtually the same time.
Parallelism, on the other hand, is not currently supported by Ruby.
So from this piece of the article, I understand that what I want to do needs to be done concurrently because I am pinging URLs on the network and that Parallelism is not currently supported by Ruby.
Next is where things get confused for me. From my part 1 question on Stack Overflow, I learned the following in a comment given to me that I should do the following:
Use a thread pool; don't just create a thousand concurrent threads. For something like
connecting to a URL where there will be a lot of waiting you can
oversubscribe the number of threads per CPU core, but not by a huge
amount. You'll have to experiment.
Another user says this:
You'd not spawn thousands of threads, use a connection pool
(e.g https://github.com/mperham/connection_pool) so you have maximum
20-30 concurrent requests going (this maximum number should be
determined by testing at which point network performance drops and you
get these timeouts)
So for this part, I turned to concurrent-ruby and implemented both a CachedThreadPool and a FixedThreadPool with10 threads. I chose a `CachedThreadPool because it seemed to me that the number of threads needed would be taken care of for me by the Threadpool. Now in concurrent ruby's documentation for a pool, I see this:
pool = Concurrent::CachedThreadPool.new
pool.post do
# some parallel work
end
I thought we just established in the first article that parallelism is not supported in Ruby, so what is the thread pool doing? Is it working concurrently or in parallel? What exactly is going on? Do I need a thread pool or not? Also at this point in time I thought connection pools and thread pools were the same just used interchangeably. What is the difference between the two pools and which one do I need?
In another excellent article How to Perform Concurrent HTTP Requests in Ruby and Rails, this article introduces the Concurrent::Promises class form concurrent ruby to avoid locks and have thread safety with two api calls. Here is a snippet of code below with the following description:
def get_all_conversations
groups_thread = Thread.new do
get_groups_list
end
channels_thread = Thread.new do
get_channels_list
end
[groups_thread, channels_thread].map(&:value).flatten
end
Every request is executed it its own thread, which can run in parallel because it is a blocking I/O. But can you see a catch here?
In the above code, another mention of parallelism which we just said didn't exist in ruby. Below is the approach with Concurrent::Promise
def get_all_conversations
groups_promise = Concurrent::Promise.execute do
get_groups_list
end
channels_promise = Concurrent::Promise.execute do
get_channels_list
end
[groups_promise, channels_promise].map(&:value!).flatten
end
So according to this article, these requests are being made 'in parallel'. Are we still talking about concurrency at this point?
Finally, in these two articles, they talk about using Futures for concurrent http requests. I won't go into the details but I'll paste the links here.
1.Using Concurrent Ruby in a Ruby on Rails Application
2. Learn Concurrency by Implementing Futures in Ruby
Again, what's talked about in the article looks to me like the Concurrent::Promise functionality. I just want to note that the examples show how to use the concepts for two different API calls that need to be combined together. This is not what I need. I just need to make thousands of API calls fast and log the results.
In conclusion, I just want to know what I need to do to make my code faster and thread safe to make it run concurrently. What exactly am I missing to make the code go faster because right now it is going so slow that I might as well not have used threads in the first place.
Summary
I have to ping thousands of URLs using threads to speed up the process. The code is slow and I am confused if I am using threads, thread pools, and concurrency correctly.
Let us look at the problems you have described and try to solve these one at a time:
You have two pieces of code, SslClient and the script which uses this ssl client. From my understanding of the threadpool, the way you have used the threadpool needs to be changed a bit.
From:
pool = Concurrent::CachedThreadPool.new
pool.post do
[LARGE LIST OF URLS TO PING].each do |struct|
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_for_termination
to:
pool = Concurrent::FixedThreadPool.new(10)
[LARGE LIST OF URLS TO PING].each do | struct |
pool.post do
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
end
end
pool.shutdown
pool.wait_form
In the initial version, there is only one unit of work that is posted to the pool. In the second version, we are posting as many units of work to the pool as there are items in LARGE LIST OF URLS TO PING.
To add a bit more about Concurrency vs Parallelism in Ruby, it is true that Ruby doesn't support true parallelism due to GIL (Global Interpreter Lock), but this only applies when we are actually doing any amount of work on the CPU. In case of a network request, CPU bound work duration is very negligible compared to the IO bound work, which means that your usecase is a very good candidate for using threads.
Also by using a threadpool, we can minimize the overhead of thread creation incurred by the CPU. When we use a threadpool, like in the case of Concurrent::FixedThreadPool.new(10), we are literally restricting the number of threads that are available in the pool, for an unbound threadpool, new threads are created for everytime when a unit of work is present, but rest of thre threads in the pool are busy.
In the first article, there was a need to collect the result returned by each individual workers and also to act meaningfully in case of an exception (I am the author). You should be able to use the class given in that blog without any change.
Lets try rewriting your code using Concurrent::Future since in your case too, we need the results.
thread_pool = Concurrent::FixedThreadPool.new(20)
executors = [LARGE LIST OF URLS TO PING].map do | struct |
Concurrent::Future.execute({ executor: thread_pool }) do
ssl_client = SslClient.new(struct.domain.gsub("*.", "www."), struct.scan_port)
cert_info = ssl_client.ping_for_certificate_info
struct.x509_cert = cert_info[:certificate]
struct.verify_result = cert_info[:verify_result]
struct
end
end
executors.map(&:value)
I hope this helps. In case of questions, please ask in comments, I shall modify this write up to answer those.
I am using the Ruby gem https://github.com/redis/redis-rb.
I want to use pipeline to send several Redis commands in 1 network trip to the Redis server. How can I do this if I have a loop?
For instance, would this work? Or would it simply send all the commands one by one?
cache = Redis.new() #blah blah
normalized = cache.pipelined do
urls.each do |url|
key= "key:#{url}"
cache.get(key)
key2 = "key2:#{url}"
cache.get(key2)
end
end
The phrasing "one network trip" is a misunderstanding. All pipelined mode does is send in other commands while waiting on the results of the previous ones. This is in contrast to the default where each request blocks until completed.
If that Ruby library blocks then it will issue them sequentially, and I believe it blocks on anything that requires results. There are asynchronous libraries that do make much better use of the pipelined mode because it's easier to match results to variables in that model. It's also a lot more work.
Normally you use pipelined for doing multiple assignments, not retrieval. That way you don't need to wait for the result of an INCR to complete before moving to the next one, you can just fire-and-forget.
If you're looking to do quick retrievals, use MGET.
I'm rebuilding a forum/board in rails. One of the requirements is that view information be recorded for a subject.
In the current system, a database call is made every time the page is loaded updating the view count for that post.
I would like to avoid that and am looking at implementing redis to record that information using a technique similar to this post - jQuery Redis hit counter to track view of cached Rails pages
So I would make a request to a controller that would record the view - via javascript - and then a cron job would move the redis usage data to the database (removing it from redis).
My quandary is that the current system offers real-time usage information so that will be the expectation moving forward. Using Heroku - as I plan - the most frequent cron jobs would run hourly, which I don't think will be acceptable.
My thought was that I could store the usage information in redis and then while I'm looping through the subjects, I would combine the usage value stored in redis with the value that had been saved in the database from the cron job.
Is this a dumb idea? I'm new to redis so I don't really know what is possible. Is it a huge no-no to do a redis call in a loop like I'm suggesting?
If you really need the old application to mantain real-time statistics, and want to use Redis, then you would have to change legacy code to access it.
Here's a starting point for your code.
At every hit, you can check thread's counter in Redis. If the counter key doesn't exist, this activates load.
So this would be a way to keep the stats updated (using php, phpredis client):
try {
$redis = new \Redis();
$thread_id = getFromPostGet("thread_id"); //suppose so
$key = 'ViewCounterKey:' . $thread_id; //each thread has a counter key
$redis->multi(); //begin transaction
if (!$redis->exists($key)) {
$counter = getFromDB("count(*) where thread_id = $thread_id"); //suppose so
$redis->set($key, $counter);
}
$redis->incr($key); //every hit incrs the counter
$redis->exec(); //end transaction
}
catch (\RedisException $e) {
echo "Server down";
}
So this solution can be put together with cron jobs, which would persist the view count, and the latency of 1h between each cron would not matter, because you're always looking into memory (Redis, not DB).
Hope that makes sense.
Currently I am running on rails 3.1 rc4 and using redis and resque for queueing creation of rackspace servers.
The rackspace gem that I am using, cloudservers, tells you when your server is done being setup with the status method.
What I am trying to do with the code below is execute the code in the elsif only after the server is active and ready to be used.
class ServerGenerator
#queue = :servers_queue
def self.perform(current_id)
current_user = User.find(current_id)
cs = CloudServers::Connection.new(:username => "***blocked for security***", :api_key => "***blocked for security***")
image = cs.get_image(49) # Set the linux distro
flavor = cs.get_flavor(1) # Use the 256 Mb of Ram instance
newserver = cs.create_server(:name => "#{current_user.name}", :imageId => image.id, :flavorId => flavor.id)
if newserver.status == "BUILD"
newserver.refresh
elsif newserver.status == "ACTIVE"
# Do stuff here, I generated another server with a different, static name
# so that I could see if it was working
cs = CloudServers::Connection.new(:username => "***blocked for security***", :api_key => "***blocked for security***")
image = cs.get_image(49)
flavor = cs.get_flavor(1)
newserver = cs.create_server(:name => "working", :imageId => image.id, :flavorId => flavor.id)
end
end
end
When I ran the above, it only generated the first server that uses the "current_user.name", as it's name. Would a loop help around the if statement? Also this seems like a poor way of queueing tasks.
Should I enqueue a new task that just checks to see if the server is ready or not?
Thanks a bunch!
Based upon what you've written, I'm assuming that cs.create_server is non-blocking. In this case, yes, you would need to wrap your check in a do ... loop or some similar construct. Otherwise you're checking the value precisely once and then exiting the perform method.
If you're going to loop in the method, you should add in sleep calls, otherwise you're going to burn a lot of CPU cycles doing nothing. Whether to loop or call a separate job is ultimately up to you and whether your workers are mostly idle. Put another way, if it takes 5 min. for your server to come up, and you just loop, that worker is not going to be able to process any other jobs for 5 min. If that's acceptable, it's certainly the easiest thing to do. If not acceptable, you'll probably want another job that accepts your server ID and makes an API call to see if it's available.
That process itself can be tricky though. If your server never comes online for whatever reason, you could find yourself creating jobs waiting for its status ad infinitum. So, you probably want to pass some sort of execution count around too, or keep track in redis, so you stop trying after X number of tries. I'd also check out resque-scheduler so you can exert control over when your job gets executed in this case.