Is there an ATOM Client or framework that enables capture of a feed entry EXACTLY once?
If not, what is the best architecture?
We would like to parse an ATOM feed, persisting all syndication feed entries locally as individual records (in database or file system). These records may have to be deleted periodically for efficiency. So client must keep track of which entries it has already looked at, independently - of the said persistence.
Have you looked at Superfeedr? Its a software as a Service platform that does just that: fetches feeds, parsed them and sends the new entries to your endpoints when they're available.
Answering my own, per working solution developed. In one word, the architectural solution to capturing only new and unique entries from a syndication feed is - CACHING.
Specifically, the entries must be stored by the client to support the logic "does the feed have anything new for me?". I believe there is no shortcut to this "client-side" solution.
Conditional-GET is not a complete solution, even if supported server side by the syndicated feed. For instance, if the client client does not send the exact If-Modified-Since time-stamp, the server can ignore the header and simply generate all entries again. Per Chris Berry, Bryon Jacob. Updated 10/27/08,
...a better way to do this is to use the start-index parameter, where the client sets this value to the end-Index, returned in the previous page. Using start-index ensures that the client will never see the same response twice, or miss Entries, since several Entries could have the same "update date", but will never have the same "update index".
In short, there is no standard server side solution guaranteeing "new/unique". The very idea of "uniqueness" is a client-side issue anyway and same opinion may not be shared by the server. From that perspective, it would be impossible for server to satisfy all clients. And, anyway, the question does not pertain to developing a better syndication server, but a smarter client, therefore, Caching is the way to go.
Cache implementation must persist between feed polls and the time-to-live (ttl), time-to-idle (tti) properties of entries stored in cache must be set appropriately to BOTH limit the performance/size of cache AND also to adequately cover the feed's oldest entries between polling cycles. Cache could be memory resident, database, file system or network array. A product like EHCache, ehcache.org, provides just about all the functionality needed.
The feed entries may persisted as-is, but the best (most efficient) method would be to identify contents, or combinations thereof, that make them unique. Methods like serialization in java or Google's Protocol Buffers may be used to create unique compact keys to persist in the cache. In my simple solution, I did not even bother storing the entries, just the keys generated as an MD5 hash of a couple of entry fields by which I defined how an entry would be unique for my purpose.
Hope this flowchart is helpful.
Related
For my SPA I have a series of Lookup entities that are loading into my entity manager on page load for various pick lists and lookup values. What I'd like to do is store these entities in local storage and import them into my manager instead of requesting them over the network.
These lookups can be edited by 3 people in my company. What I'm trying to figure out is how to version these lookups in local storage so that the file can be updated when a lookup changes (or at least give the client-side capability for determining when the records stale to request new ones). How can I achieve this? My lookups are simply tables in my overall database, and I don't see a way for the client-side to recognize when the lookups have changed.
I'm reluctant to add a timestamp column because I would need to evaluate the entities in local storage and compare them to the ones on the database and get the ones needed. Not sure how I would save page load time there.
I'm considering moving all of my lookups into a separate database and version the whole thing, requesting new lookups when any one of them changes. I would need to write a mechanism for versioning this db whenever one of the 3 people makes an edit.
Has anyone found a better solution to a problem of this type? My lookups() function is cannibalizing the wait time on users' first access.
Consider maintaining a separate version file or web API endpoint.
Invalidate lookups by type or as a whole rather than individually.
Bump the version number(s) when anything changes. Stash version number with your local copy. Compare and reload as needed.
I have already read Rails - How do I temporarily store a rails model instance? and similar questions but I cannot find a successful answer.
Imagine I have the model Customer, which may contain a huge amount of information attached (simple attributes, data in other tables through has_many relation, etc...). I want the application's user to access all data in a single page with a single Save button on it. As the user makes changes in the data (i.e. he changes simple attributes, adds or deletes has_many items,...) I want the application to update the model, but without committing changes to the database. Only when the user clicks on Save, the model must be committed.
For achieving this I need the model to be kept by Rails between HTTP requests. Furthermore, two different users may be changing the model's data at the same time, so these temporary instances should be bound to the Rails session.
Is there any way to achieve this? Is it actually a good idea? And, if not, how can one design a web application in which changes in a model cannot be retained in the browser but in the server until the user wants to commit them?
EDIT
Based on user smallbutton.com's proposal, I wonder if serializing the model instance to a temporary file (whose path would be stored in the session hash), and then reloading it each time a new request arrives, would do the trick. Would it work in all cases? Is there any piece of information that would be lost during serialization/deserialization?
As HTTP requests are stateless you need some kind of storeage between requests. The session is the easiest way to store data between requests. As for you the session will not be enough because you need it to be accessed by multiple users.
I see two ways to achive your goal:
1) Get some fast external data storage like a key-value server (redis, or anything you prefer http://nosql-database.org/) where you put your objects via serializing/deserializing (eg. JSON).
This may be fast depending on your design choices and data model but this is the harder approach.
2) Just store your Objects in the DB as you would regularly do and get them versioned: (https://github.com/airblade/paper_trail). Then you can just store a timestamp when people hit the save-button and you can always go back to this state. This would be the easier approach i guess but may be a bit slower depending on the size of your data model changes ( but I think it'll do )
EDIT: If you need real-time collaboration between users you should probably have a look at something like Firebase
EDIT2: Anwer to your second question, whether you can put the data into a file:
Sure you can do that. But you would need some kind of locking to prevent data loss if more than one person is editing. You will need that aswell if you go for 1) but tools like redis already include locks to achive your goal (eg. redis-semaphore). Depending on your data you may need to build some logic for merging different changes of different users.
3) Another aproach that came to my mind would be doing all editing with Javascript and save it in one db-transaction. This would go well with synchronization tools like firebase (or your own synchronization via Rails streaming API)
I am looking for solution of logging data changes for public API.
There is a need to tell client app which tables form db has changed and need to be synchronised since the app synchronised last time and also need to be for specific brand and country.
Current Solution:
Version table with class_names of models which is touched from every model on create, delete, touch and save action.
When we are touching version for specific model we also look at the reflected associations and touch them too.
Version model is scoped to brand and country
REST API is responding to a request that includes last_sync_at:timestamp, brand and country
Rails look at Version with given attributes and return class_names of models which were changed since lans_sync_at timestamp.
This solution works but the problem is performance and is also hard to maintenance.
UPDATE 1:
Maybe the simple question is.
What is the best practice how to find out and tell frontend apps when and what needs to be synchronized. In terms of whole concept.
Conditions:
Front end apps needs to download only their own content changes not whole dataset.
Does not invoked synchronization when application from different country or brand needs to be synchronized.
Thank you.
I think that the best solution would be to use redis (or some other key-value store) and save your information there. Writing to redis is much faster than any sql db. You can write some service class that would save the data like:
RegisterTableUpdate.set(table_name, country_id, brand_id, timestamp)
Such call would save given timestamp under key that could look like i.e. table-update-1-1-users, where first number is country id, second number is brand id, followed by table name (or you could use country and brand names if needed). If you would like to find out which tables have changed you would just need to find redis keys with query "table-update-1-1-*", iterate through them and check which are newer than timestamp sent through api.
It is worth to rmember that redis is not as reliable as sql databases. Its reliability depends on configuration so you might want to read redis guidelines and decide if you would like to go for it.
You can take advantage of the fact that ActiveModel automatically logs every time it updates a table row (the 'Updated at' column)
When checking what needs to be updated, select the objects you are interested in and compare their 'Updated at' with the timestamp from the client app
The advantage of this approach is that you don't need to keep an additional table that lists all the updates on models, which should speed things up for the API users and be easier to maintain.
The disadvantage is that you cannot see the changes in data over time, you only know that a change occurred and you can access the latest version. If you need to track changes in data over time efficiently, than I'm afraid you'll have to rework things from the top.
(read last part - this is what you are interested in)
I would recommend that you use the decorator design pattern for changing the client queries. So the client sends a query of what he wants and the server decides what to give him based on the client's last update.
so:
the client sends a query that includes the time it last synched
the server sees the query and takes into account the client's nature (device-country)
the server decorates (changes accordingly) the query to request from the DB only the relevant data, and if that is not possible:
after the data are returned from the database manager they are trimmed to be relevant to where they are going
returns to the client all the new stuff that the client cares about.
I assume that you have a time entered field on your DB entries.
In that case the "decoration" of the query (abstractly) would be just to add something like a "WHERE" clause in your query and state you want data entered after the last update.
Finally, if you want that to be done for many devices/locales/whatever implement a decorator for the query and the result of the query and serve them to your clients as they should be served. (Keep in mind that in contrast with a subclassing approach you will only have to implement one decorator for each device/locale/whatever - not for all combinations!
Hope this helped!
Let’s say I have a weather web service that I’m hitting (consuming) every page load. Not very efficient or smart and probably going to exceed my API limit or make the webservice owners mad. So instead of fetching directly from a controller action, I have a helper / job / method (some layer) that has the chance to cache the data a little bit. Let’s also say that I don’t care too much about the real-time-ness of the data.
Now what I’ve done in the past is simply store the attributes from the weather service in a table and refresh the data every so often. For example, the weather service might look like this:
Weather for 90210 (example primary key)
-----------------------------
Zip Name: Beverly Hills
Current Temperature: 90
Last Temp: 89
Humidity: 0
... etc.
So in this case, I would create columns for each attribute and store them when I fetch from the webservice. I could have an expiring rails action (page caching) to do the refresh or I could do a background job.
This simple approach works well except if the webservice has a large list of attributes (say 1000). Now I’m spending a lot of time creating and maintaining DB columns repeating someone else’s attributes that already exist. What would be great is if I could simply cache the whole response and refer to it as a simple Hash when I need it. Then I’d have all the attributes cached that the webservice offers for “free” because all the capabilities of the web service would be in my Hash instead of just caching a subset.
To do this, I could maybe fetch the webservice response, serialize it (YAML maybe) and then fetch the serialized object if it exists. Meh, not great. Serializing can get weird with special characters. It’d be really cool if I could just follow a memcached type model but I don’t think you can store complex objects in memcached right? I'd also like to limit the amount of software introduced, so a stand-alone proxy layer would be suboptimal imo.
Anyone done something similar or have a name for this?
If the API you're hitting is RESTful and respects caching, don't reinvent the wheel. HTTP has caching built into it (see RFC 2616), so try to use it as far as possible. You have two options:
Just stick a squid proxy between your app and the API and you're done.
Use Wrest - we wrote it to support HTTP 2616 caching and it's the only Ruby HTTP wrapper that I know that does.
If the API doesn't respect caching (most do) then the other advice you've received makes sense. What you actually use to hold your cache (mongodb/memcached/whatever) depend on a bunch of other factors, so really, that depends on your situation.
You can use MongoDB (or another JSON datastore) and get the results of the API in JSON, store the results into your mongo collection. Then get the data and attributes that you care about, and ignore the rest.
For your weather API call, you can check to see if that city exists in your mongo collection, and if not get via the API (and then store in mongo).
It would be a modification of the Rails.cache pattern.
I'm going to attempt to create an open project which compares the most common MP3 download providers.
This will require a user to enter a track/album/artist name i.e. Deadmau5 this will then pull the relevant prices from the API's.
I have a few questions that some of you may have encountered before:
Should I have one server side page that requests all the data and it is all loaded simultaneously. If so, how would you deal with timeouts or any other problems that may arise. Or should the page load, then each price get pulled in one by one (ajax). What are your experiences when running a comparison check?
The main feature will to compare prices, but how can I be sure that the products are the same. I was thinking running time, track numbers but I would still have to set one source as my primary.
I'm making this a wiki, please add and edit any issues that you can think of.
Thanks for your help. Look out for a future blog!
I would check amazon first. they will give you a SKU (the barcode on the back of the album, I think amazon calls it an EAN) If the other providers use this, you can make sure they are looking at the right item.
I would cache all results into a database, and expire them after a reasonable time. This way when you get 100 requests for Britney Spears, you don't have to hammer the other sites and slow down your application.
You should also make sure you are multithreading whatever requests you are doing server side. Curl for instance allows you to pull multiple urls, and assigns a user defined callback. I'd have the callback send a some data so you can update your page with as the results come back. GETTUNES => curl callback returns some data for each url while connection is open that you parse it on the client side.