Apache Solr: Merging documents from two sources before indexing - solr4

I need to index data from a custom application in Solr. The custom app stores metadata in an Oracle RDBMS and documents (PDF, MS Word, etc.) in a file store. The two are linked in the sense that the metadata in the database refers to a physical document (PDF) in the file store.
I am able to index the metadata from the RDBMS without issues. Now I would like to update the indexed documents with an additional field in which I can store the parsed content from the PDFs.
I have considered and tried the following
1. Using Update RequestHandler to try and update the indexed document with . This didn't work and the original document indexed from the RDBMS was overwritten.
2. Using SolrJ to do atomic updates but I am not sure if this is a good approach for something like this
Has anyone come across this issue before and what would be the recommended approach?

You can update the document, but it requires that you know the id of the existing document. For example:
{
"id": "5",
"parsed_content":{"set": "long text field with parsed content"}
}
Instead of just saying "parsed_content":"something" you have to wrap the value in "parsed_content":{"set":"something"} to trigger adding it to the existing document.
See https://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22field.22 for documentation on how to work with multivalued fields etc.

Related

Getting more than 1000 documents using folder() in Appian

I am writing an Appian web API, to retrieve documents from our Appian system which will be used to integrate with our other systems.
To this end, I am using the folder() method to get information about the contents of a folder in Appian.
folder(
theCaseFolder,
"documentChildren"
)
The problem I am having is that while this code works most of the time - we have some cases where there are more than 1000 documents stored against the case. I note that the Appian documentation states that:
The documentChildren and folderChildren properties return up to the first 1000 documents or folders, respectively, that are direct children of the selected folder.
My problem is that we have a few cases where there are more than 3000 documents attached to the case. Is there a way to get a list of of those child documents, or am I plain out of luck?
In long term I would suggest storing some information about document in separate table in db. In this way you can query db as you wish by Appian or by SQL.
In short term you can get first 1000 as it is in documentation and then move them to subfolder/different folder or delete. This can be repeated multiple times to get all files from folder.
Move Document Appian Function

Differences in Umbraco cache structure?

Ok, So I have just spent the last 6-8 weeks in the weeds of Umbraco and have made some fixes/Improvements to our site and environments. I have spent a lot of that time trying to correct lower level Umbraco caching related issues. Now reflecting on my experience and I still don't have a clue what the conceptual differences are between the following:
Examine indexes
umbraco.config
cached xml file in memory (supposedly similar to umbraco.config)
CMSContentXML Table
Thanks Again,
Devin
Examine Indexes are index of umbraco content
So when ever you create/update/delete content, the current content information will be indexed
This index are use for searching - under the hood, it is lucene index
umbraco backend use these index for searching
You can create your own index if you want
more info checkout, Overview & Explanation - "Examining Examine by Peter Gregory"
umbraco.config and cached xml in memory are really the same thing.
The front end UmbracoHelper api get content from the cache not the database - the cache is from the umbraco.config
CMSContentXML contains each content's information as xml
so essentially this xml represent all the information of a node content
So in a nutshell they represent really 3 things:
examine is used for searching
umbraco.config cached data - save round trip to DB
CMSContentXML stores full information of a content
Edit to include better clarification from Robert Foster comment and the UmbracoHelper vs ExamineManager
For the umbraco.config and CMSContentXML table, #robert-foster commented
umbraco.config stores the most recent version of all published content only; the in-memory cache is a cached version of this file; and the cmscontentxml table stores a representation of all content and is used primarily for preview mode - it is updated every time a content item is saved. IIRC it also stores a representation of other content types
Regards to UmbracoHelper vs ExamineManager
UmbracoHelper api mostly get it's content from the memory cache - IMO it works best when locating direct content, such as when you know the id of content you want, you just call Umbraco.TypedContent(id)
But where do you get the id you want in the first place? or put it another way, say if you want to find all content's property Title which contain a word "Test", then you would use Examine to search for it. Because Examine is really lucene wrapper, so it is going to be fast and efficient
Although you can traverse tree by method such as Umbraco.TypedContent(id).Children then use linq to filter the result, but I think this is done in memory using linq-to-object, so it is not as efficient and preferment as lucene
So personally I think:
use Examine when you are searching (locating) for content - because you can use the capability of a proper search engine lucene
once you got the ids from the search result, use UmbracoHelper to get the full publish content representation of the content id into strong type model and work with the data.
one thing #robert-foster mention in the comment which, I did not know is that UmbracoHelper provides Search method which is a wrapper around the examine, so use that if more familiar with that api.
Lastly, if any above statement are wrong or not so correct, help me clarify so that anyone look at it later will not get it wrong, thanks all.

Updating core data performance

I'm creating an app that uses core data to store information from a web server. When there's an internet connection, the app will check if there are any changes in the entries and update them. Now, I'm wondering which is the best way to go about it. Each entry in my database has a last updated timestamp. Which of these 2 will be more efficient:
Go through all entries and check the timestamp to see which entry needs to be updated.
Delete the whole entity and re-download everything again.
Sorry if this seems like an obvious question and thanks!
I'd say option 1 would be most efficient, as there is rarely a case where downloading everything (especially in a large database with large amounts of data) is more efficient than only downloading the parts that you need.
I recently did something similiar.
I solve the problem, by assigning an unique ID and a global 'updated timestamp' and thinking about 'delta' change.
I explain better, I have a global 'latest update' variable stored in user preferences, with a default value of 01/01/2010.
This is roughly my JSON service:
response: {
metadata: {latestUpdate: 2013...ecc}
entities: {....}
}
Then, this is what's going on:
pass the 'latest update' to the web service and retrieve a list of entities
update the core data store
if everything went fine with core data, the 'latestUpdate' from the service metadata became my new 'latest update variable' stored in user preferences
That's it. I am only retrieving the needed change, and of course the web service is structured to deliver a proper list. Which is: a web service backed by a database, can deal with this matter quite well, and leave the iphone to be a 'simple client' only.
But I have to say that for small amount of data, it is still quite performant (and more bug free) to download the whole list at each request.
As per our discussion in the comments above, you can model your core data object entries with version control like this
CoreDataEntityPerson:
name : String
name_version : int
image : BinaryData
image_version : int
You can now model the server xml in the following way:
<person>
<name>michael</name>
<name_version>1</name_version>
<image>string_converted_imageData</image>
<image_version>1</image_version>
</person>
Now, you can follow the following steps :
When the response arrives and you parse it, you initially create a new object from entity and fill the data directly.
Next time, when you perform an update on the server, you increase the version count of an entry by 1 and store it.
E.g. lets say the name michael is now changed to abraham, then version count of name_version on server will be 2
This updated version count will come in the response data.
Now, while storing the data in the same object, if you find the version count to be same, then the data update of that entry can be skipped, but if you find the version count to be changed, then the update of that entry needs to be done.
This way you can efficiently perform check on each entry and perform updates only on the changed entries.
Advice:
The above approach works best when you're dealing with large amount of data updation.
In case of simple text entries for an object, simple overwrite of data on all entries is efficient enough. And this also keeps the data reponse model simple.

Adding custom attributes to Task?

How can i add custom attributes/data to Task via API . for example we wanted to add field like customer contact number or deal amount e.t.c
We don't currently support adding arbitrary metadata to tasks, though it's something we're thinking about. In the meantime, what many customers do is to simply put data in the note field in an easily-parseable form, which works well and also lets humans reading the task see the e.g. ticket number.
It's not a terribly elegant solution, but it works.
https://asana.com/developers/documentation/getting-started/custom-external_data
Custom external data allows a client application to add app-specific metadata to Tasks in the API. The custom data includes a string id that can be used to retrieve objects and a data blob that can store character strings.
See the external field at https://asana.com/developers/api-reference/tasks

List of all movie title, actors, directors, writers on Imdb

I am working on a web app which lets users tell their favourite movies, directors, movie- writers, and actors. For this I want to provide them a dropdown list or auto complete for each of them so that they can just pick their choices.
For this:
I need a list of all movie titles, actors, directors, writers present on Imdb.
I checked Imdbpy and it does not seem to provide methods to get this data.
Would using imdbpy2sql.py to create a database and using sql to query the db, provide the required data? Is there any other way to do this?
Thanks!
Using imdbpy2sql.py to create a database and using SQL to query the db, will provide you the required data.
You can also try using Java Movie Database or imdbdumpimport to read in the text files to SQL.
The last option to do this is parsing the plain text files provided by IMDb yourself.
I think your best option is to parse the plain text files distributed here: imdb interfaces.
You probably just need the 'movies', 'actors', 'actresses' and 'director' file; they are quite easy to parse.

Resources