Index if not exists using bulk processor in elasticsearch

Index if not exists using bulk processor in elasticsearch - twitter

I am trying to index a document if it doesn't already exist in elasticsearch. I am using BulkProcessor when indexing my documents and using Requests.add action. I will have the exact same id sometimes, does it not add automatically, but update?
P.S. Update is not a requirement, it can stay as is.
P.S.2 I am trying to integrate a user's past tweets into elasticsearch-twitter-river's user stream.

If you index a doc with the same document id then it will do an update. Otherwise it will add a new document.
In other words, if you PUT a doc to {index}/{type}/{id}, then it will always update (overwrite) the document with that id. If you POST a doc to {index}/{type} then in general Elasticsearch will generate a new document for each of your POSTs. That is, unless you mapped a document field to the _id field in mappings.
It seems that the Twitter River uses the PUT method with explicitly specifying the id so tweets with the same id will probably be overwritten.

Related

Is there a way to Bulk extract contact details out of Oracle Eloqua's API?

I am trying to extract a large amount of details out of our Eloqua system using it's API and got this API to work perfectly for single IDs: https://docs.oracle.com/en/cloud/saas/marketing/eloqua-rest-api/op-api-rest-1.0-data-contact-id-get.html
The problem is that I need to run this for a large number of IDs and it will require alot in order to run it for the entire population. Is there any bulk APIs that can extract all of the following details out of Eloqua/Contact for the entire population? I don't see any on that pages documentation that meet this need under the Bulk section.
contactid, company, employees, company_revenue, business_phone, email_address, web_domain, date_created, date_modified, address_1, address_2, city, state_or_province, zip_or_postal_code, mobile_phone, first_name, last_name, title

It's a multi-step process with the Bulk API, typically in the following fashion:
Get a list of the current internal field names - useful for creating your export definition
Create an export definition and post it here. There is a useful example on the page, you do not need a filter criteria. Store the export ID somewhere
Using your export definition id, create a sync. It will gather the data in the background and prepare it for you. Take note of the sync ID provided in the initial response.
Check on the sync status with your sync ID here. It should only take a couple of minutes - and there is a callback url option as well in the previous step, if you don't want to keep polling.
Once your data is ready, use that sync id and request the data. Depending on how many rows were retrieved, you might need to paginate through the results using the offset query param. By default it will give you JSON, but I usually choose CSV (specify in the header).
If you need updated data, feel free to create a new sync using the same export definition id. You do not need to create a new export definition each time.

Apache Solr: Merging documents from two sources before indexing

I need to index data from a custom application in Solr. The custom app stores metadata in an Oracle RDBMS and documents (PDF, MS Word, etc.) in a file store. The two are linked in the sense that the metadata in the database refers to a physical document (PDF) in the file store.
I am able to index the metadata from the RDBMS without issues. Now I would like to update the indexed documents with an additional field in which I can store the parsed content from the PDFs.
I have considered and tried the following
1. Using Update RequestHandler to try and update the indexed document with . This didn't work and the original document indexed from the RDBMS was overwritten.
2. Using SolrJ to do atomic updates but I am not sure if this is a good approach for something like this
Has anyone come across this issue before and what would be the recommended approach?

You can update the document, but it requires that you know the id of the existing document. For example:
{
"id": "5",
"parsed_content":{"set": "long text field with parsed content"}
}
Instead of just saying "parsed_content":"something" you have to wrap the value in "parsed_content":{"set":"something"} to trigger adding it to the existing document.
See https://wiki.apache.org/solr/UpdateXmlMessages#Optional_attributes_for_.22field.22 for documentation on how to work with multivalued fields etc.

Difference between vcard and vcard_search table in ejabberd

What is difference between vcard and vcard_search table in ejabberd.I mean to say for what purpose they are used?

vcard table is used to store the raw vCard. vcard_search is an index table, used for searching user vCard by field. It is for example use in user directory queries to find the list of matching users.
You can read more details about each field here: http://docs.ejabberd.im/developer/sql-schema/

Twitter Search API IDs meaning

I am using the Twitter Search API and I can't understand the id field of a tweet.
For example here is one: <id>tag:search.twitter.com,2005:1990561514</id>. The real ID is the final number part, right? Why doesn't Twitter already provide this in a single element? And, why is there a year of 2005on the ID field? Is that the ID of that year and the following year tweets get an ID recounted to zero? Is the ID indexed to the year?
I am asking all this stuff, because I am going to use the option of since_id to retrive new tweets. If the ID isn't really unique and depends on the year, it won't work as expected.
Thanks.

The tag is unique - but parts of it are redundant.
tag:search.twitter.com,2005:1990561514
Obviously, search.twitter.com is the URL from where you requested the document.
The ,2005 is constant. As far as I can tell, it has never changed since the service was launched. While there's no official documentation, I would guess that it refers to the ATOM specification namespace - http://www.w3.org/2005/Atom"
Finally, the long number is the Tweet's status ID. It will always be unique and can be used for the since_id.
What you will need to do is split the string, and just use the number after the colon as your ID.

I believe you are doing something wrong. If you look at all of the example results from the Twitter Search API, none of the id fields are formatted like this one you are showing.
For example:
http://search.twitter.com/search.json?q=%40twitterapi%20-via
Also, if you check out the example requests page, you will see that all of the id fields have normal formats, i.e.:
"id":122032448266698752
Update:
Now that I know you are using the atom feed, I can see where the seemingly oddly formatted element comes from. See this article on avoiding duplicates in atom feeds. Another helpful article.
Basically, atom feeds REQUIRE a unique id for each element in a feed. Some feeds use the "tag" scheme to ensure uniqueness. This format is actually pretty common in atom feeds and many frameworks use it by default. For instance, the RoR AtomFeedHelper (which might even be what Twitter uses) specifies the default format to be:
"tag:#{request.host},#{options}:#{request.fullpath.split(".")}"

Multiple File Uploads for a new record

I have implemented multiple file uploads for existing records as seen here https://github.com/websymphony/Rails3-Paperclip-Uploadify
I would like to know if it is possible to implement multiple file uploads when creating a new record.
Since we need flash to do the multiple file uploads, how would it be possible to associate the uploaded files with the record if the record has not yet been created.
I have thought of a hack-ish way to essentially make a "draft" and update it. However, I hope there is a better way to do this.

There is no better than the kind of hackish way you present:
creating orphans objects and give them parents later or delete them (sad huh? :) )
creating parent object by default, add some confirmation field in the form so that you know what objects really have an owner, delete the rest.
BTW, you don't "need" flash for multiple uploads, see here: http://blueimp.github.com/jQuery-File-Upload/

Yes, you can use http://blueimp.github.com/jQuery-File-Upload/. But there are still some points you need to be careful.
Don't forget to remove appended array after you define the file field with "multiple". For example: replace "photo[image][]" with "photo[image]".Otherwise file uploaders like "carrierware" will not be working.
If you are using rails 3.2.13, the appended array will always appear no matter whether you set the name to be without appended array. You can use "file_field_tag" to resolve this problem. Please refer this issue to: https://github.com/tors/jquery-fileupload-rails/issues/36.
For the association:
You need to create a hidden text field which contains IDs of images that will be associated to the record you are going to create.
Upload images by "jquery-fileupload"(it is ajax), then get IDs of images from response bodies.
Set these IDs to the hidden field.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Index if not exists using bulk processor in elasticsearch - twitter

Related

Is there a way to Bulk extract contact details out of Oracle Eloqua's API?

Apache Solr: Merging documents from two sources before indexing

Difference between vcard and vcard_search table in ejabberd

Twitter Search API IDs meaning

Multiple File Uploads for a new record

Categories

Resources