I'm making a small rails application that fetch data from some different languages at github-api.
The problem is, when i click the button that will fetch the informations, it takes a long time to redirect to the correct page. What i got from network is, the TTFB is actually 30s (!) and is getting a response with the status 302.
The controller function that is doing the logic:
Language.delete_all
search_urls = Introduction.all.map { |introduction| "https://api.github.com/search/repositories?q=#{introduction.name}&per_page=1" }
search_urls.each do |search_url|
json_file = JSON.parse(open(search_url).read)
pl = Language.new
pl.hash_response = json_file['items'].first
pl.name = pl.hash_response['language']
pl.save
end
main_languages = %w[ruby javascript python elixir java]
deletable_languages = Introduction.all.reject do |introduction|
main_languages.include?(introduction.name)
end
deletable_languages.each do |language|
language.delete
end
redirect_to languages_path
end
I believe the bottleneck is the http request in which you are doing it one by one. You could have filtered the languages that you want before generating the url and fetch them.
However, if the count of the urls after filtered is still large, say 20-50, assuming each request take 200ms, this would take at least 4s to 10s just for http request. Thats already too long for the user to wait for. In that case you should make it a background job.
If you insist to do this synchronously, you may consider fire those http requess by spawning multiple threads and join all the results after all threads are completed. You will achieve some concurrency here as the GIL will not block thread for IO wait. But this is very prone to error as you need to manage the threads on your own.
Related
So I'm building a website that calls a third-party API that can take from 20 seconds to 30 minutes to return a result. But I can't know this duration in advance so need to poll it frequently to check if the work is done (returns "COMPLETE" and the result) or not (returns "IN_PROGRESS"). Also, this API might be called many times from many users at the same time.
So I created a Sidekiq worker that checks the API every 5 seconds until it receives "COMPLETE", and only then it ends. But I've read that Sidekiq should only be doing short-lived jobs, and I'm struggling to get my head around how should I do it. Also I've been trying to search for an answer but I suspect I don't know the words to find what I'm looking for.
I'm sure there is a way I can tell my workers to call the API once, and if the result is "IN_PROGRESS" end but make sure another worker will do another API call to check, and so on and so on until the result is "COMPLETE".
Also, I guess this is also handy to better distribute the load in case many users demand the use of said API, because fewer workers can do more of this short-lived jobs.
This is my worker, which I hope clarifies what I'm doing right now:
class ThingProgressWorker
include Sidekiq::Worker
def perform(id)
#thing = Thing.find(id)
#thing_api_call = ThingAPICall.new // This uses the ruby library of the API
completed = false
while completed == false
result = #thing_api_call.get_result( { thing_job_name: #thing.job_name })
if !result.include? "COMPLETED"
completed = false
sleep 5
else
completed = true
#thing.status = "completed"
#thing.save
break
end
end
end
end
So if the API takes ten minutes to go from "IN_PROGRESS" to "COMPLETED" this worker will be busy for that long, which I recon is not advised at all.
I've been thinking about this for some hours now and can't think of how should I do to make each API call its own job without having a worker busy until the API is done.
The only solution I've thought so far is having a master worker that calls another worker for each API call, but then I'll still have a worker busy for as long as the API takes to send the result.
I'd appreciate any help or directions!
Thanks in advance
Try to call the worker with a delay. for example:
class ThingProgressWorker
include Sidekiq::Worker
def perform(id)
#thing = Thing.find(id)
#thing_api_call = ThingAPICall.new // This uses the ruby library of the API
result = #thing_api_call.get_result( { thing_job_name: #thing.job_name })
if !result.include? "COMPLETED"
ThingProgressWorker.perform_in(1.minute, id)
else
completed = true
#thing.status = "completed”
#thing.save
end
end
end
This will add the worker to the queue but will not run it immediately but in the time you specify.
In Python I have the option of using a "poller" object which polls blocking sockets for messages waiting and unblocks after a specified number of milliseconds (in the case below, 1000, in the while True block):
import zmq
# now open up all the sockets
context = zmq.Context()
outsub = context.socket(zmq.SUB)
outsub.bind("tcp://" + myip + ":" + str(args.outsubport))
outsub.setsockopt(zmq.SUBSCRIBE, b"")
inreq = context.socket(zmq.ROUTER)
inreq.bind("tcp://" + myip + ":" + str(args.inreqport))
outref = context.socket(zmq.ROUTER)
outref.bind("tcp://" + myip + ":" + str(args.outrefport))
req = context.socket(zmq.ROUTER)
req.bind("tcp://" + myip + ":" + str(args.reqport))
repub = context.socket(zmq.PUB)
repub.bind("tcp://" + myip + ":" + str(args.repubport))
# sort out the poller
poller = zmq.Poller()
poller.register(inreq, zmq.POLLIN)
poller.register(outsub, zmq.POLLIN)
poller.register(outref, zmq.POLLIN)
poller.register(req, zmq.POLLIN)
# UDP socket setup for broadcasting this server's address
cs = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
cs.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
cs.setsockopt(socket.SOL_SOCKET, socket.SO_BROADCAST, 1)
# housekeeping variables
pulsecheck = datetime.utcnow() + timedelta(seconds = 1)
alivelist = dict()
pulsetimeout = 5
while True:
polls = dict(poller.poll(1000))
if inreq in polls:
msg = inreq.recv_multipart()
if msg[1] == b"pulse": # handle pluse
ansi("cyan", False, textout = " pulse" + "-" + msg[0].decode())
if not msg[0] in alivelist.keys():
handlechange(msg[0])
alivelist[msg[0]] = datetime.utcnow() + timedelta(seconds = pulsetimeout)
if outsub in polls:
msgin = outsub.recv_multipart()[0]
repub.send(msgin) # republish
msg = unpacker(msgin)
if isinstance(msg, dict):
valu = msg.get("value")
print(".", end = "", flush = True)
else:
ansi("green", False, textout = msg)
if req in polls:
msg = req.recv_multipart()
valmsg = validate_request(msg)
if not valmsg[0]:
ansi("red", True); print(valmsg[1]); ansi()
elif len(alivelist) > 0:
targetnode = random.choice(list(alivelist.keys()))
inreq.send_multipart([targetnode, packer(valmsg[1])])
ansi("blue", True, textout = "sent to " + targetnode.decode())
else:
ansi("red", True, textout = "NO CONNECTED NODES TO SEND REQUEST TO")
if outref in polls:
msg = outref.recv_multipart()
destinataire, correlid = msg[1].split(b"/")
req.send_multipart([destinataire, correlid, msg[2]])
I want to implement something analogous in Elixir (or Erlang) but my preferred native library, chumak, doesn't seem to implement polling. How do I implement non-blocking receives in Erlang/Elixir, preferably using Chumak, but I'll move to another Erlang zeroMQ library if necessary? My socket pattern preference is router sends, dealer receives.
EDIT
My use case is the following. I have a third party financial service which serves data based on requests, with answers coming asynchronously. So you can send multiple requests, and you'll get responses back after an unspecified period of time, and not necessarily in the same order you sent them.
So I need to connect this service into Erlang (actually Elixir) and ZeroMQ seems like a good fit. Multiple users connected (via Phoenix) to Erlang/Elixir will send requests, and I need to pass these on to this service.
The problem comes if there is an error in one of the requests, or the third party service has some kind of problem. I will be blocking-waiting for a response, and then unable to service new requests from Phoenix.
Basically I want to listen constantly for new requests, send them over, but if one request doesn't produce a response, I will have one-fewer responses than requests and that will lead to an eternal wait.
I understand that if I send requests separately, then the good ones will produce responses so I don't need to worry about blocking even if, over time, I get quite a big numerical difference between requests sent and responses received. Maybe the design idea is that I shouldn't worry about this? Or should I try to track one-for-one responses to requests and timeout the non-responses somehow? Is this a valid design pattern?
Is your system constantly connected to the asynchronous query resource, or are you making a new connection with each query?
Each situation has its own natural model in Erlang.
The case of: A single (or pool of) long-term connection(s)
Long-term connections that maintain a session with the resource (the way a connection with a database would work) are most naturally modelled as processes within your system that have the sole job of representing that external resource.
The requirements of that process are:
Translate the external resource's messages into internally meaningful messages (not just passing junk through -- don't let raw, external data invade your system unless it is totally opaque to you)
Keep track of timed out requests (and this may require something sort of like polling, but can be done more precisely with erlang:send_after/3
This implies, of course, that the module that implements this process will need to speak the protocol of that resource. But if this is accomplished then there really isn't any need for a messaging broker like an MQ application.
This allows you to have that process be reactive and block on receive while the rest of your program goes off to do whatever its doing to do. Without some arbitrary polling that will surely run you into the Evil Black Swamp of Scheduling Issues.
The case of: A new connection per query
If each query to the resource requires a new connection the model is similar, but in here you spawn a new process per query and it represents the query itself within your system. It blocks waiting for the response (on a timeout), and nothing else matters to it.
That is the easier model, actually, because then you don't have to scrub a list of past, possibly timed out requests that will never return, don't have to interact with a set of staged timeout messages sent via erlang:send_after/3, and you move your abstraction one step closer to the actual model of your problem.
You don't know when these queries will return, and that causes some potential confusion -- so modeling each actual query as a living thing is an optimal way to cut through the logical clutter.
Either way, model the problem naturally: As a concurrent, asynch system
In no case, however, do you want to actually do polling the way you would in Python or C or whatever. This is a concurrent problem, so modelling it as such will provide you a lot more logical freedom and is more likely to result in a correct solution that lacks corners that give rise to weird cases.
I'm requesting amazon product advertising api with code like this:
products = asins.map do |asin|
item = Amazon::Ecs.item_lookup(asin, response_group: :Large)
json = {asin: item.get_element('Item').get('ASIN'),
manufacturer: item.get_element('ItemAttributes').get('Manufacturer'),
model: item.get_element('ItemAttributes').get('Model')}
end
And get 503 error: You are submitting requests too quickly. Please retry your requests at a slower rate.
I found out that they want 1 request per second.
What's the best way of doing it in my case?
Perhaps just decelerate by waiting a second between to iterations:
products = asins.map do |asin|
sleep 1 # wait one second before doing the next API call
item = Amazon::Ecs.item_lookup(asin, response_group: :Large)
{
asin: item.get_element('Item').get('ASIN'),
manufacturer: item.get_element('ItemAttributes').get('Manufacturer'),
model: item.get_element('ItemAttributes').get('Model')
}
end
Using sleep is for sure the first solution that comes to mind. In my opinion, it's not an elegant one, becacause it's totally not managable. I would think of some queueing system to do the work - maybe sidekiq using a self triggering worker?
Some simplified code:
# some kind of queueing logic, to fetch asins
asin = AsinQueue.fetch
# trigger first worker
LookupWorker.perform_async(asin)
# and the worker itself:
class LookupWorker
include Sidekiq::Worker
def perform(asin)
item = Amazon::Ecs.item_lookup(asin, response_group: :Large)
# all the domain logic
# queue next lookup
next_asin = AsinQueue.fetch
LookupWorker.perform_in(1.second, next_asin)
end
end
ItemLookup supports batch requests. You can lookup up to 10 items at once.
I have an iPhone app, where I want to show how many people are currently viewing an item as such:
I'm doing that by running this transaction when people enter a view (Rubymotion code below, but functions exactly like the Firebase iOS SDK):
listing_reference[:listings][self.id][:viewing_amount].transaction do |data|
data.value = data.value.to_i + 1
FTransactionResult.successWithValue(data)
end
And when they exit the view:
listing_reference[:listings][self.id][:viewing_amount].transaction do |data|
data.value = data.value.to_i + -
FTransactionResult.successWithValue(data)
end
It works fine most of the time, but sometimes things go wrong. The app crashes, people loose connectivity or similar things.
I've been looking at "onDisconnect" to solve this - https://firebase.google.com/docs/reference/ios/firebasedatabase/interface_f_i_r_database_reference#method-detail - but from what I can see, there's no "inDisconnectRunTransaction".
How can I make sure that the viewing amount on the listing gets decremented no matter what?
A Firebase Database transaction runs as a compare-and-set operation: given the current value of a node, your code specifies the new value. This requires at least one round-trip between the client and server, which means that it is inherently unsuitable for onDisconnect() operations.
The onDisconnect() handler is instead a simple set() operation: you specify when you attach the handler, what write operation you want to happen when the servers detects that the client has disconnected (either cleanly or as in your problem case involuntarily).
The solution is (as is often the case with NoSQL databases) to use a data model that deals with the situation gracefully. In your case it seems most natural to not store the count of viewers, but instead the uid of each viewer:
itemViewers
$itemId
uid_1: true
uid_2: true
uid_3: true
Now you can get the number of viewers with a simple value listener:
ref.child('itemViewers').child(itemId).on('value', function(snapshot) {
console.log(snapshot.numChildren());
});
And use the following onDisconnect() to clean up:
ref.child('itemViewers').child(itemId).child(authData.uid).remove();
Both code snippets are in JavaScript syntax, because I only noticed you're using Swift after typing them.
I noticed that when I fetch a site that is not responding using Mechanize, it just keeps on waiting.
How can I overcome this problem?
There's a couple ways to deal with it.
Open-Uri, and Net::HTTP have ways of passing in timeout values, which then tell the underlying networking stack how long you are willing to wait. For instance, Mechanize lets you get at its settings when you initialize an instance, something like:
mech = Mechanize.new { |agent|
agent.open_timeout = 5
agent.read_timeout = 5
}
It's all in the docs for new but you'll have to view the source to see what instance variables you can get at.
Or you can use Ruby's timeout module:
require 'timeout'
status = Timeout::timeout(5) {
# Something that should be interrupted if it takes too much time...
}
http://mechanize.rubyforge.org/mechanize/Mechanize.html on this page there are 2 undocumented attributes open_timeout and read_timeout, try using them.
agent = Mechanize.new { |a| a.log = Logger.new("mech.log") }
agent.keep_alive=false
agent.open_timeout=15
agent.read_timeout=15
HTH