I'm looking to truncate and save JSON data from an external URL at a certain time (10pm) every day. The URL requires a login which I have access to, but not sure how best to implement this.
The JSON is structured as such:
{
"Region": [{"id":"1","region":"South"}],
"Agent": [{"id":"1","first_name":"Tim","last_name":"Jones"}]
}
I have created the model's Region and Agent to store the data in, matching the fields above.
This is what I have so far in Region model
json = JSON.parse('http://....')
json['Region'].each do |data|
Region.create(
id: data['id'],
region: data['region']
)
My questions are:
Is this possible via the model, or do I have to place this in a
controller
How do I go about truncating and saving at a certain time?
To get the data from an external url, you'll need to use an http client such as Faraday or Httparty
Once you got your json data you can parse and manipulate it however you want.
To automate it at 10pm every day you need to schedule a background job using something like ActiveJob or Sidekiq.
Related
From the Best Practices Guide to using Sidekiq, I understand it's best to pass "string, integer, float, boolean, null(nil), array and hash" as arguments to the job.
I often just pass the id of a persisted object to my jobs, but due to latency constraints I need to save the object after running the job.
The non-persisted object I'm working with contains a mixture of data types:
#MyObject<00x000>{
id: nil
start_time: Fri, 11 Dec 2020 08:45:00 PST -08:00 (*this is a TimeWithZone object)
rate: 18.0 (*this is a BigDecimal object)
...
}
I plan to pass this object to my job by converting it to a hash first:
MyJob.perform_async(my_object.attributes)
and then later persist the object like so:
MyObject.new(my_object_hash).save
My question is, is this safe? Even though I am passing a 'simple' datatype to Sidekiq, it actually contains complex objects. Am I going to lose precision?
Thank you!
This sounds like a "potayto, potahto" solution. You are not not using the serialisation of Sidekiq, but instead serialize it yourself.
Let's have a look at why sidekiq has this rule:
Even if they did serialize correctly, what happens if your queue backs up and that quote object changes in the meantime? [...]
Don't pass symbols, named parameters, keyword arguments or complex Ruby objects (like Date or Time!) as those will not survive the dump/load round trip correctly.
I like to add a third:
Serializing state makes it impossible to distinguish between persisted and ethereal (in-memory, memoized, lazy-loaded etc) data. E.g. a def sent_mails; #sent_mails ||= Mail.for(user_id: id); end now gets serialized: do you want that?
The solution is also provided by sidekiq:
Don't save state to Sidekiq, save simple identifiers. Look up the objects once you actually need them in your perform method.
The XY problem here
Your real problem is not where or how to serialize state. Because sidekiq warns against serializing state regardless of where and how you do this.
The problem you need to solve is either how to store state somewhere where it can be stored properly. Or to avoid storing the state at all: not in redis/sidekiq, nor in the storage that is giving you problems.
Latency
Is your storage slow? Is it not a validation, a serialisation, some side-effect of storage that is slow?
Can you improve this by making it a two-step: insert the state and update/enrich/validate it async later? If you are using Rails, it won't help you here, or might even work against you, but a common model is to store objects in a special "queue" table or events queue; e.g. kafka is famous for this.
When e.g. storage happens over a slow network to a slow API, this is probably unsolvable, but when storage happens in a local database, there are decades of solutions to improve write performance here that you can use. Both inside your database, or with some specialised queue for state-storage (sidekiq is not such a specialised storage queue) depending on the tech used to store. E.g. Linux will allow you to store through memory, making writes to disk really quick, but removing the guarantee that it was really written to disk.
E.g. In a bookkeeping api, we would store the validated object in PostgreSQL and then have async jobs add expensive attributes to this later (e.g. state that had to be retrieved from legacy APIs or through complex calculations).
E.g. in a write-heavy GIS system, we would store objects into a "to_process_places" table, that was monitored by tooling which processes the Places. It all really depends on your domain, and requirements.
Not using state.
A common solution is not to make objects, but use the actual payload by the customer. Just send the HTTP payload (in rails, the params) along and leave it at that. Maybe merge in a header (like the Request Date) or filter out some data (header tokens or cookies).
If your controller can operate with this data, so can a delayed job. Instead of building objects in the controller, leave that to the delayed job. This can even result in really neat and lean controllers: all they do is (some authentication and authorization and then) call the proper job and pass it a sanitized params.
Obviously this requires trade-offs like not being able to validate in-sync, but to give such info over email, push-notification, or delayed response instead, depending on your requirements (e.g. a large CSV import could just email any validation issues, but a login request might need to get immediate response if the login is invalid).
It also requires some thought: you probably don't want to send the Base64 encoded CSV along to sidekiq, but instead write the file to a (temp) storage and pass the filename/url along instead. This might sound obvious, because it is: file uploads are essentially an implementation of the earlier mentioned "temporary state storage": you don't pass the entire PDF/high-res-header-image/CSV along to sidekiq, but store it somewhere so sidekiq can pick it up later to process it. Why should the other attributes not employ the same pattern if passing them along to sidekiq is problematic?
The most important part from the best practices you linked is
Complex Ruby objects do not convert to JSON
Therefore you're not supposed to pass instances of a model to a worker.
If you're using Sidekiq workers, you should comply with this statement and the hash you're passing should be just fine. I am not exactly sure about the TimeWithZone object, but you could try converting this to a JSON or to a string as they do in the best practices guide.
However, if you're using ActiveJob instead of Sidekiq workers (does your Job inherit from ApplicationJob or does it include Sidekiq::Worker ?), then you don't have that problem because ActiveJob uses Global ID to convert objects into a String. And then before performing the job is deserializing the object again. Meaning you can pass an object to your job.
my_object = MyObject.find(1)
my_object.to_global_id #=> #<GlobalID:0x000045432da2344 [...] gid://your_app_name/MyObject/1>>
serialized_my_object = my_object.to_global_id.to_s
my_object = GlobalID.find(serialized_my_object)
You can find more information here
https://github.com/toptal/active-job-style-guide#active-record-models-as-arguments
After doing some experimentation on the Time objects in my job, I found that I am losing nanosecond precision at the other end of the job.
my_object.start_time
=> Mon, 21 Dec 2020 11:35:50 PST -08:00
my_object.strftime('%Y-%m-%d %H:%M:%S.%N')
=> "2020-12-21 11:35:50.151893000"
You can see here, we have precision including 6 digits after the decimal.
(see this answer for more about 'strftime')
Once we call JSON methods on the object:
generated = JSON.generate(my_object.attributes))
=> \"start_time\":\"2020-12-21T11:35:50.151-08:00\"
You can see here we are down to 3 digits of precision after the decimal. The remaining 3 digits are lost at this point.
parsed = JSON.parse(generated)
parsed[‘start_time’] = "2020-12-21T11:35:50.151-08:00"
It appears at the most basic level, the JSON library recursively calls as_json on each of the key-value pairs in the hash. So really it depends on how your particular object implements as_json.
This issue caused test failures that involved querying our db for persisted objects (initialized with something like, start_time = Time.zone.now (!)) that are meant to overlap in time exactly with our MyObject class. Once the half-baked my_object blueprints made it through Sidekiq, they lost a sliver of precision, causing a slight misalignment.
One way to hack away at this issue is by monkey patching the Time class.
In our case, a better solution was to go in the opposite direction and to not use so much precision in our tests. The my_object in the example is something that a human user will have on their calendar; in production we never receive so much precision from clients. So instead we fixed our tests by instructing some of our test objects to use something like Time.zone.now.beginning_of_minute, rather than Time.zone.now. We intentionally removed precision to fix the issue, as well as more closely mirror reality.
I have a url which has been supplied which data is updated every 30 mins and wondering if i can save the data to my database as it updates? I'm using Rails 4.2.0.
There a 10 url's all up, each with a different unit number, which needs to be reference to be able to call each data for each unit.
URL structure
http://sitename/cgi-bin/site=1
JSON structure
{"status"=>"ok", "data"=>[{"2014-08-11 11:00:00"=>14.9},{"2014-08-11 11:30:00"=>15.1}]}
With your json response, it can be done with something like this:
json = JSON.parse('{"status"=>"ok", "data"=>[{"2014-08-11 11:00:00"=>14.9},{"2014-08-11 11:30:00"=>15.1}]}') #string representing your json
json['data'].each do |element|
element.each do |key, value|
Model.create(date: key, number: value) # This Model is the name of your model
end
end
If you let me suggest you something, You can send json as:
{"status"=>"ok", "data"=>[{"date" => "2014-08-11 11:00:00", "number" =>14.9},{...}]}
So you can access data like: element['date'] and element['number']
I think you should go for crone jobs. Refer this Whenever gem that provides a clear syntax for writing and deploying cron jobs and you would save your data on every 30 mins into the database.
I have a class method (placed in /app/lib/) which performs some heavy calculations and sub-http requests until a result is received.
The result isn't too dynamic, and requested by multiple users accessing a specific view in the app.
So, I want to schedule a periodic run of the method (using cron and Whenever gem), store the results somewhere in the server using JSON format and, by demand, read the results alone to the view.
How can this be achieved? what would be the correct way of doing that?
What I currently have:
def heavyMethod
response = {}
# some calculations, eventually building the response
File.open(File.expand_path('../../../tmp/cache/tests_queue.json', __FILE__), "w") do |f|
f.write(response.to_json)
end
end
and also a corresponding method to read this file.
I searched but couldn't find an example of achieving this using Rails cache convention (and not some private code that I wrote), on data which isn't related with ActiveRecord.
Thanks!
Your solution should work fine, but using Rails.cache should be cleaner and a bit faster. Rails guides provides enough information about Rails.cache and how to get it to work with memcached, let me summarize how I would use it in your case
Heavy method
def heavyMethod
response = {}
# some calculations, eventually building the response
Rails.cache.write("heavy_method_response", response)
end
Request
response = Rails.cache.fetch("heavy_method_response")
The only problem here is that when ur server starts for the first time, the cache will be empty. Also if/when memcache restarts.
One advantage is that somewhere on the flow, the data u pass in is marshalled into storage, and then unmartialled on the way out. Meaning u can pass in complex datastructures, and dont need to serialize to json manually.
Edit: memcached will clear your item if it runs out of memory. Will be very rare since its using a LRU (i think) algoritm to expire things, and I presume you will use this often.
To prevent this,
set expires_in larger than your cron period,
change your fetch code to call the heavy_method if ur fetch fails (like Rails.cache.fetch("heavy_method_response") {heavy_method}, and change heavy_method to just return the object.
Use something like redis which will not delete items.
I'm calling an API method from a server:
def get_data
#.........
get_some_data_from_server
end
I want to cache the result of this call, obviously. So I created a field in a table and changed the get_data to look like this:
def fetch_data
key = get_cache_key
Rails.cache.fetch key, expires_in: 500.minutes do
get_some_data_from_server
end
end
The result of get_some_data_from_server doesn't change frequently, it's pretty much the same all the time. But it may change over time, even early when 500 minutes have passed. And thus the user may receive an outdated data from cache.
Is this strategy sensible? What do I do about get_some_data_from_server changing over time?
If it has changed, then the key should change. Ideally, you want your key to reflect, in some part, something unique about the data it represents.
In the case of request, for example. If you were making a request that included two dates, to receive data between two points. My key would include those two dates, so everytime I request data from a different range of dates, it wont use the same cache key.
In the case of when the request will always be the same, there should be a way to determine if the api has changed, perhaps through a less expensive api call. If this call says that new data is available, then you should clear the cache at that key and request the new data.
One way to check if data is changed, before you make a request is using ETag's and conditional gets. A detailed description can be found here http://fideloper.com/api-etag-conditional-get
If the result of get_some_data_from_server is changed than the call
Rails.cache.delete(get_cache_key)
https://stackoverflow.com/a/19603852/1289704
I am attempting to get all the orders from a magento instance. Once a day we grab all the orders.. (sometimes a few thousand)
Extra stuff that's more why I ask:
I'm using ruby-on-rails to grab the orders. This involves sending the soap call to the magento instance. It's easy as.
Once I have the response, I convert it into a Hash (a tree) and then pick out the increment id's of the orders and proceed to call getOrder with the increment id.
I have two problems with what's going on now, one operational, and one religious.
Grabbing the XML response to the list request takes really really long and when you tack on the work involved in converting the XML to a hash, I'm seeing a really slow processes.
The religious bit is that I just want the increment_ids so why do I have to pay for the processing/bandwidth to support a hugely bloated response.
Ok so the question...
Is there a way to set the response returned from Magento, to include only specific fields? Only the updated_at and the increment_id for instance.
If not, is there another call I'm not aware of, that can get just the increment_ids and date?
Edit
Below is an example of what I'm looking for from magento but it's for ebay. I send this xml up to ebay, and get back a really really specific bit of info about the product. It works for orders and such too. I can say "only this" and get just that. I want the same from Magento
<GetItemRequest xmlns="urn:ebay:apis:eBLBaseComponents">
<SKU>b123-332</SKU><OutputSelector>ItemId</OutputSelector>
</GetItemRequest>
I've created a rubygem that gives you your salesOrderList response in the form of a hash, and you can do what you want with the orders after you've received them back (i.e. select the fields you want including increment_id). Just run
gem install magento_api_wrapper
To do what you want to do, you would do something like this:
api = MagentoApiWrapper::Sales.new(magento_url: "yourmagentostore.com/index.php", magento_username: "soap_api_username", magento_api_key: "userkey123")
orders = api.order_list(simple_filters: [{key: "status" value: "complete"}])
orders.map {|o| [o.increment_id, o.items.first.sku] }
Rough guess, but you get the idea. You would get the array of hashes back and you can do what you want with them after that. Good luck!