Differences in Umbraco cache structure? - umbraco

Ok, So I have just spent the last 6-8 weeks in the weeds of Umbraco and have made some fixes/Improvements to our site and environments. I have spent a lot of that time trying to correct lower level Umbraco caching related issues. Now reflecting on my experience and I still don't have a clue what the conceptual differences are between the following:
Examine indexes
umbraco.config
cached xml file in memory (supposedly similar to umbraco.config)
CMSContentXML Table
Thanks Again,
Devin

Examine Indexes are index of umbraco content
So when ever you create/update/delete content, the current content information will be indexed
This index are use for searching - under the hood, it is lucene index
umbraco backend use these index for searching
You can create your own index if you want
more info checkout, Overview & Explanation - "Examining Examine by Peter Gregory"
umbraco.config and cached xml in memory are really the same thing.
The front end UmbracoHelper api get content from the cache not the database - the cache is from the umbraco.config
CMSContentXML contains each content's information as xml
so essentially this xml represent all the information of a node content
So in a nutshell they represent really 3 things:
examine is used for searching
umbraco.config cached data - save round trip to DB
CMSContentXML stores full information of a content
Edit to include better clarification from Robert Foster comment and the UmbracoHelper vs ExamineManager
For the umbraco.config and CMSContentXML table, #robert-foster commented
umbraco.config stores the most recent version of all published content only; the in-memory cache is a cached version of this file; and the cmscontentxml table stores a representation of all content and is used primarily for preview mode - it is updated every time a content item is saved. IIRC it also stores a representation of other content types
Regards to UmbracoHelper vs ExamineManager
UmbracoHelper api mostly get it's content from the memory cache - IMO it works best when locating direct content, such as when you know the id of content you want, you just call Umbraco.TypedContent(id)
But where do you get the id you want in the first place? or put it another way, say if you want to find all content's property Title which contain a word "Test", then you would use Examine to search for it. Because Examine is really lucene wrapper, so it is going to be fast and efficient
Although you can traverse tree by method such as Umbraco.TypedContent(id).Children then use linq to filter the result, but I think this is done in memory using linq-to-object, so it is not as efficient and preferment as lucene
So personally I think:
use Examine when you are searching (locating) for content - because you can use the capability of a proper search engine lucene
once you got the ids from the search result, use UmbracoHelper to get the full publish content representation of the content id into strong type model and work with the data.
one thing #robert-foster mention in the comment which, I did not know is that UmbracoHelper provides Search method which is a wrapper around the examine, so use that if more familiar with that api.
Lastly, if any above statement are wrong or not so correct, help me clarify so that anyone look at it later will not get it wrong, thanks all.

Related

Apache Solr: Merging documents from two sources before indexing

I need to index data from a custom application in Solr. The custom app stores metadata in an Oracle RDBMS and documents (PDF, MS Word, etc.) in a file store. The two are linked in the sense that the metadata in the database refers to a physical document (PDF) in the file store.
I am able to index the metadata from the RDBMS without issues. Now I would like to update the indexed documents with an additional field in which I can store the parsed content from the PDFs.
I have considered and tried the following
1. Using Update RequestHandler to try and update the indexed document with . This didn't work and the original document indexed from the RDBMS was overwritten.
2. Using SolrJ to do atomic updates but I am not sure if this is a good approach for something like this
Has anyone come across this issue before and what would be the recommended approach?
You can update the document, but it requires that you know the id of the existing document. For example:
{
"id": "5",
"parsed_content":{"set": "long text field with parsed content"}
}
Instead of just saying "parsed_content":"something" you have to wrap the value in "parsed_content":{"set":"something"} to trigger adding it to the existing document.
See https://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22field.22 for documentation on how to work with multivalued fields etc.

Search in Content and skip html elements in mvc

I want to search in content and I don't want to get fault result.
assume users search 'br' I don't want to see in output results that have <br> or <P> and other html elements
Simply, you must strip the tags before you search. However, that would mean not being able to query the database directly. Rather, you'd have to pull all the objects first, and then query the collection in memory.
If you're going to be doing a lot of this or have large collections of objects (where pulling all of them for the initial query would be a performance drag), then you should look into a true search solution. I've been working with Elasticsearch, which seems to be just about the best out there in my opinion. It's easy to set up, easy to use, and has third-party .NET integration through the nuget package, NEST.
With a true search solution, you can index your content fields, stripped of HTML, and then run your queries on the index instead of directly on your database. You'll also get powerful advanced features such as faceting, which would be difficult or impossible to do directly with Entity Framework.
Alternatively, if you can't go full board on the search and it's unacceptable to query everything up front (which really it pretty much always is), then your only other option is to create another companion field for each HTML content property, and always save a HTML-stripped copy of the text there. Then, use that field for your search queries.

Updating core data performance

I'm creating an app that uses core data to store information from a web server. When there's an internet connection, the app will check if there are any changes in the entries and update them. Now, I'm wondering which is the best way to go about it. Each entry in my database has a last updated timestamp. Which of these 2 will be more efficient:
Go through all entries and check the timestamp to see which entry needs to be updated.
Delete the whole entity and re-download everything again.
Sorry if this seems like an obvious question and thanks!
I'd say option 1 would be most efficient, as there is rarely a case where downloading everything (especially in a large database with large amounts of data) is more efficient than only downloading the parts that you need.
I recently did something similiar.
I solve the problem, by assigning an unique ID and a global 'updated timestamp' and thinking about 'delta' change.
I explain better, I have a global 'latest update' variable stored in user preferences, with a default value of 01/01/2010.
This is roughly my JSON service:
response: {
metadata: {latestUpdate: 2013...ecc}
entities: {....}
}
Then, this is what's going on:
pass the 'latest update' to the web service and retrieve a list of entities
update the core data store
if everything went fine with core data, the 'latestUpdate' from the service metadata became my new 'latest update variable' stored in user preferences
That's it. I am only retrieving the needed change, and of course the web service is structured to deliver a proper list. Which is: a web service backed by a database, can deal with this matter quite well, and leave the iphone to be a 'simple client' only.
But I have to say that for small amount of data, it is still quite performant (and more bug free) to download the whole list at each request.
As per our discussion in the comments above, you can model your core data object entries with version control like this
CoreDataEntityPerson:
name : String
name_version : int
image : BinaryData
image_version : int
You can now model the server xml in the following way:
<person>
<name>michael</name>
<name_version>1</name_version>
<image>string_converted_imageData</image>
<image_version>1</image_version>
</person>
Now, you can follow the following steps :
When the response arrives and you parse it, you initially create a new object from entity and fill the data directly.
Next time, when you perform an update on the server, you increase the version count of an entry by 1 and store it.
E.g. lets say the name michael is now changed to abraham, then version count of name_version on server will be 2
This updated version count will come in the response data.
Now, while storing the data in the same object, if you find the version count to be same, then the data update of that entry can be skipped, but if you find the version count to be changed, then the update of that entry needs to be done.
This way you can efficiently perform check on each entry and perform updates only on the changed entries.
Advice:
The above approach works best when you're dealing with large amount of data updation.
In case of simple text entries for an object, simple overwrite of data on all entries is efficient enough. And this also keeps the data reponse model simple.

Ruby on Rails: How to have multiple controllers for one table AND multiple models

I'm new to Ruby and to Rails. I have played a bit with Sinatra but I think that Rails is a more complete framework for my project. However, I am running into trouble with this.
I am working with an fairly substantial existing, and heavily used, mySQL database and I am trying to build an API for this that will report on certain features. The features that are needed are, for the most part, counts of records by certain groupings, then drilling down into details.
For example we have a table - tableA, that contains lots of information relating to documentation. One piece of information we want to report on from that is the number of items in a given language. The language code is stored against each item and based on a get request I would like to return JSON.
Request: /languages/:code/count/:tablename
There are two variables in that most specific URL - the code we are counting and the table we are counting from.
I understand that in routes.rb I can set up a mapping:
get '/languages/:code/count/:table', :controller=>'languages', :action=>'count'
I have a controller - languages_controller.rb with a count method in it. this then matches to a corresponding view file count.html.erb
In all the tutorials I have read and examples I have followed the main point seems that 'languages' would be a table in the database and would therefore be available under the 'magic' Rails approach.
My issue is that it is not a table, rather the results of the call should be a limited subset of the fields in tableA. Such as languagecode and count(id).
The description of the language needs to be looked up 'manually' as it is stored as an internal code that is not in a database anywhere (historic decision/madness).
The questions:
how do I have a model that is only a subset of fields, plus some that are manually populated - languagecode, isocode, description, count
Am I right in thinking that once I have the model defined as such as I could use ActiveRecord to get data from the database and then in the controller add the extra information in?
Can I change table in the model based on the parameter sent in the URL?
Essentially, I am at a loss at the moment on what to do with this. I have the routes defined, the view templates in place and the controller there and ready to go. The database component - getting some data from a pre-existing table seems mysterious to me.
Any help is greatly appreciated, it seems that the framework is currently getting in my way and I know that I can't be the only one trying this sort of thing so if you have any advice please share.
There's really no need for a model here, at all. This isn't what ORMs are for. What you should be doing is just running raw SQL against the database, and iterating over the results. Consider doing something like this: https://stackoverflow.com/a/14840547/229044

Response comments added to the wrong parent document

I have a view data source that uses a view key to access documents and show them in a repeat with var "posts". within the repeat I have a document data source with var "post" that gets's the the unid of the documents using posts.getUniversalID().
further down the repeat I have another document data source "newcomment" that is a response and take the parent id as: post.getDocument().getUniversalID()
below the newcomment data source I have an editbox and a submit button which saves the comment as a response to the "post" using newcomment.save()
Here is my problem
two people access the same xpage. personA enters the page and starts writing a comment to a post. in the same time personB creates a new post and submit it before personA submits the comment. What happens now is that the comments gets binded to the latest post and not to the post personA responded to.
I tried anothoher thing also, let's say there is 10 posts in that database. personA and personB access the xpages. personA start writing a comment to post number 8. at the same time personB creates two new posts in the database. when personA now submits the comment it seem to get bind to the same index which is now two posts up. but still index 8. which is ofcourse the wrong post.
If I change the repeat to "createControlsAtPageCreation" ie.e repeatControls=true the comment is attached to the correct post but then I run into another problem that the view is not updated to show the latest posts.
my repeat is wihtin a custom control that is loaded dynamically using the dynamic content control in extlib.
As information here is what I have found about the repeatControls settings
Setting the repeatControls property to true instructs the repeat control to create a new copy of its children for each iteration over the dataset.
When the Repeat control is configured with the property
repeatControls=“true” , it repeats its contents only once, at page load time
So my question here is that I do not understand what is going on. why is my comment attached to the wrong parent document? and is there a way I can prevent this and still have new posts displayed correctly
thanks for your help
Without the code it's a bit hard to imagine what exactly is going one here but this looks very similar to problem that I had with repeat control and value binding.
Long story short the problem was connecet to repeatControls property set to false. When it was like that data binding were working only for first element in collection - all data was somehow magically saved to this first object! I managed to get this working by using combination of dynamic content control rebuild and repeatControls set to true. Only then databindings were working property.
It seems like if You are repeating rendering only (and this is what repeatControls set to false do) the decoding phase of jsf lifecycle goes foobar.
Without your XSP markup, it's difficult to be absolutely definitive but it appears that you're app code is creating and persisting the datasources and components per row during page load - therefore increasing the overall size and complexity of the component tree also. You should alternatively try an approach that will lazy-load the datasource only when requested by the end-user (eg: edit / reply).
Refer to the XPages Extension Library demo application (XPagesExt.nsf) for examples that use such a dynamic approach. In particular, look at Core_InPlaceForm.xsp which demonstrates using the xe:inPlaceForm control within a xp:repeat. And also see Domino_ForumView.xsp which demonstrates using the xe:forumView and xe:forumPost controls to manage and visualize a hierarchical thread. Also consider the concurrency mode that best suits your requirements when it actually comes to saving any given post or comment (fail, createConflict, force, exception) and document locking for high contention situations. The above-mentioned controls all provide the highest level of dynamic control and datasource creation and destruction.
Please feel free to send me on a worked example database, where I can understand your exact use case - DM me or email me.

Resources