ElasticSearch: Specify which shard to store data on - geolocation

I wanted to store my data on different shards based on some field value. For example, geo-sharding is something which I am looking for. All my records with continent value 'NA' should go to shard-1, North America; 'EU' should go to shard-2, Europe and so on.
Is there a way I can specify which shard the record (document) should go to?
Tried to find this, but in return I get only the literature associated with shards. Any information on this will be helpful!

You can influence data repartition with the routing parameter. In your case, using the continent name as the routing key will group the documents by a specific continent on the same shard. However, you won't be able to choose directly in which shard store the document.
Here is the definitive guide section about it, and the index API documentation concerning routing.
Be aware that this can end with some shards/nodes being a lot more used than other.

Related

Riak: search by key prefix

I'm a newcomer to Riak and I've been reading this chapter from riak's docs. It goes to show that by adding structure information to buckets and keys one can overcome some of the limitations of key/value operations.
Though the article states an example on how such key would be structured:
sensor data keys could be prefaced by sensor_ or temp_sensor1_ followed by a timestamp
(e.g. sensor1_2013-11-05T08:15:30-05:00)
no method is mentioned on how to query the data by key prefix (e.g sensor1_). Looking around stackoverflow I found this question. In it MapReduce and key filtering are mentioned as a possible solution. But the documentation on key filters states that they are a soon-to-be deprecated feature. I also checked out Riak search as a possible way but wasn't able to find a way to query data by key prefix.
My question is: What is the best way to search data by key prefix? I would greatly appreciate an example.
The best way to search for a key prefix is to not do it if you don't need to, i.e. design around that search pattern if you can. The primary way to do that is to use deterministic keys that your application can easily compute. That said, if you cannot avoid building your application to require searching on key prefixes there are couple of things you can do (all of which have their drawbacks).
Key Filters - http://docs.basho.com/riak/latest/dev/references/keyfilters/ - as you noted already these are marked as deprecated and not recommended at this point.
MapReduce - http://docs.basho.com/riak/latest/dev/advanced/mapreduce/ - a good option if you can query in batches but not really suited for real time querying. You could cache the query results if precomputing the queries is helpful.
Riak Search 2.0 (Solr) - http://docs.basho.com/riak/latest/dev/using/search/ - this is probably the easiest method to implement from an application perspective and allows to query your keys using a query along the lines of: 'curl "$RIAK_HOST/search/sensor?wt=json&q=_yz_rk:sensor1_*"'. Using search does come with a performance hit over straight key based queries but you can cache queries.
Data Modeling - querying by key directly is always going to provide the best performance as mentioned above. One option is to to take advantage of Riak's Data Types (CRDTs) and create a bucket that uses sets. You could create a set for each sensor that contained a list of keys associated with that sensor in the first bucket. Then you can iterate over the keys in the set and do a multi-get to return all of associated records.
Hope this gives you some ideas.

Mongodb: Is it a good idea to create a unique index on web URLs?

My document looks like:
{"url": "http://some-random-url.com/path/to/article"
"likes": 10
}
The url needs to be unique. Is it a good idea to have a unique index on the url? The URL can be long, resulting in larger index size, more memory footprint, and slower overall performance. Is it a good idea to generate a hash from the url (i am thinking about using murmur3) and create a unique index on that instead. I am assuming that the chances of collision are pretty low, as described here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Does anyone see any drawbacks to this approach? The new document will look like (with a unique index on u_hash instead of url):
{"url": "http://some-random-url.com/path/to/article"
"likes": 10
"u_hash": "<murmur3 hash of url>"
}
UPDATE
I will not be doing regex queries on the url. Will be doing only a complete URL look up. I am more concerned about the performance of this look up, as I believe it will also be used internally by mongodb to maintain unique index, and hence affecting write performance as well (+ longer index). Additionally, my understanding is that mongobd doesn't perform well for long text indexes, as it wasn't designed for that purpose. I may be wrong though, and it could only depend on whether or not that index fits into RAM. Any pointers?
I'd like to expand on the answer of #AlexRyan. While he is right in general, there are some things which need to be taken into consideration for this use case.
First of all, we have to differentiate between a unique index and the _id field.
When the URL needs to be unique in your use case, there has to be a unique index. What we have to decide is wether to use the URL itself or a hashed value of it. The hashing itself would not help with the search, as the hash sum saved in a field would be treated as a string by MongoDB. It may safe space (URLs may be shorter than their hash value), hereby reducing the memory needed for the index. However, doing so takes away the possibility to search for parts of the URL in the index, for example with
db.collection.find({url:{$regex:/stackoverflow/}})
With a unique index on url, this query would use an index, which will be quite fast. Without such (unique) index, this query will result in a comparably slow collection scan.
Plus, creating the hash each and every time before querying, updating or inserting doesn't make these operations faster.
This leaves us with the fact that creating a hash sum and a unique index on it may save some RAM at the cost of making queries on the actual field slower at orders of magnitude. And it introduces the need of creating a hash sum each and every time. Having a index on both the URL and it's hashed value would not make sense at all.
Now to the question wether it is a good idea to use URL as _id one way or the other. Since URLs usually are distinct by nature (they are supposed to return the same content) and the likes are related to that uniqueness, I would tend to use the URL as the id. Since you need the unique index on _id anyway, it serves two purposes here: you have your id for the document, you ensure uniqueness of the URL and - in case you use the natural representation of the URL - it will even be queryable in an efficient way.
Use a unique index on url
db.interwebs.ensureIndex({ "url" : 1}, { "unique" : 1 })
and not a hashed index. Hashed indexes in MongoDB are meant to be used for hashed shard keys and not for unique constraints. From the hashed index docs,
Hashed indexes support sharding a collection using a hashed shard key. Using a hashed shard key to shard a collection ensures a more even distribution of data.
and
You may not create compound indexes that have hashed index fields or specify a unique constraint on a hashed index
If url needs to be unique and you will use it to look up documents, it's absolutely worth having a unique index on url. If you want to use url as the primary key for documents, you can store the url value in the _id field. This field is normally a driver-generated ObjectId but it can be any value you like. There's always a unique index on _id in a MongoDB collection so you get the unique index "for free".
I think the answer is "it depends".
Choosing keys that have no real world meaning embedded in them may save you pain in the future. This is especially true if you decide you need to change it but you have a lot of foreign keys referencing it.
Most database management systems offer you a way to generate unique IDs.
In Oracle, you might use a sequence.
In MySQL you might use AUTO_INCREMENT when you define the table itself.
The way that mongodb assigns unique ids to documents is different than in relational databases. They use ObjectIDs for this purpose.
One of the interesting things about ObjectIDs is that they are generated by the driver.
Because of the algorithm that is used to generate them, they are guaranteed to be unique even if you have a large cluster of app and database servers.
You can learn more about them here:
http://docs.mongodb.org/manual/reference/object-id/
A lot of engineering work has gone into ensuring that ObjectIds unique.
I use them by default unless there is a really good reason not to.
So far, I have not found a really good reason to not use them.

Parsing Wikipedia countries, regions, cities

Is it possible to get a list of all Wikipedia countries, regions and cities with relations between them? I couldn't find any API appropriate for this task.
What is be the easiest way to parse all the information I need?
PS: I know, that there are another datasources I can get this information from. But I am interested in Wikipedia...
[2020 update] this is now best done using the Wikidata Query Service, you can run super specific queries with a bit of SPARQL, example: Find all countries and their label. See Wikidata Query Help
It might be a bit tedious to get the whole graph but you can get most of the data from the experimental/non-official Wikidata Query API.
I suggest the following workflow:
Go to an instance of the kind of entities you want to work with, say Estonia (Q191) and look for its instance of (P31) properties, you will find: country, sovereign state, member of the UN, member of the EU, etc.
Use the Wikidata Query API claim command to output every entity
that as the chosen P31 property. Lets try with country (Q6256):
http://wdq.wmflabs.org/api?q=claim[31:6256]
It outputs an array of numeric ids: that's your countries! (notice that the result is still incomplete as there are only 141 items found: either countries are missing from Wikidata, or, as suggested by Nemo in comments, some countries are to be found in country (Q6256) subclasses(P279))
You may want more than ids though, so you can ask Wikidata Official API for entities data:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16&format=json&props=labels|claims&languages=en|fr
(here Canada(Q16) data, in json, with only claims and labels data, in English and French. Look at the documentation to adapt parameters to your needs)
You can query multiple entities at a time, with a limit of 50, as follow:
https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q16|Q17|Q20|Q27|Q28|Q29|Q30|Q31|Q32|Q33|Q34|Q35|Q36|Q37|Q38|Q39|Q40|Q41|Q43|Q45|Q77|Q79|Q96|Q114&format=json&props=labels|claims&languages=en|fr
From every countries data, you could look for entities registered as administrative subdivisions (P150) and repeat on those new entities.
Aternatively, you can get all the tree of administrative subdivisions with the tree command. For instance, for France(Q142) that would be http://wdq.wmflabs.org/api?q=tree[142][150] Tadaaa, 36994 items! But that's way harder to refine given the different kinds of subdivision you can encounter from a country to another. And avoid doing this kind of query from a browser, it might crash.
You now just have to find cities by countries by refining this last query with the claim command, and the appropriate sub-class(P279) of municipality(Q15284) entity (all available here): for France, that's commune (Q484170), so your request looks like
http://wdq.wmflabs.org/api?q=tree[142][150] AND claim[31:484170]
then repeat for all the countries: have fun!
You should go with Wikidata and/or dbpedia.
Personally I'd start with Wikidata as it's directly using MediaWiki, with the same API so you can use similar code. I would use pywikibot to get started. Like that you can still request pages from Wikipedia where that makes sense (e.g. list pages or categories).
Here's a nice overview of ways to access Wikidata

Solr Join - getting data from different index

I'm working on a project where we have 2 million products and have 50 clients with different pricing scheme. Indexing 2M X 50 records is not an option at the moment. I have looked at solr join and cannot get it to work the way i want it too. I know it's like a self join so I'm kinda skeptical it would work but here it is anyway.
here is the sample schema
core0 - product
core1 - client
So given a client id, i wanted to display all bags manufactured by Samsonite sorted by lowest price.
If there's a better way of approaching this, I'm open to redesigning exciting schema.
Thank you in advance.
Solr is not a relational database. You should give a look at the sharding feature and split your indexes. Also, you could write your custom plugins to elaborate the price data based on client's id/name/whatever at index time (BAD you'll still get a product replicated for each client).
How we do (so you can get an example):
clients are handled by sqlite
products are stored in solr with their "base" price
each client has a "pricing rule" applied via custom query handler when they interrogate the db (it's just a value modifier)

How do database indices make search faster

I was reading through rails tutorial (http://ruby.railstutorial.org/book/ruby-on-rails-tutorial#sidebar-database_indices) but confused about the explanation of database indicies, basically the author proposes that rather then searching O(n) time through the a list of emails (for login) its much faster to create an index, giving the following example:
To understand a database index, it’s helpful to consider the analogy
of a book index. In a book, to find all the occurrences of a given
string, say “foobar”, you would have to scan each page for “foobar”.
With a book index, on the other hand, you can just look up “foobar” in
the index to see all the pages containing “foobar”.
source:
http://ruby.railstutorial.org/chapters/modeling-users#sidebar:database_indices**
So what I understand from that example is that words can be repeated in text, so the "index page" consists of unique entries. However, in the railstutorial site, the login is set such that each email address is unique to an account, so how does having an index make it faster when we can have at most one occurrence of each email?
Thanks
Indexing isn't (much) about duplicates. It's about order.
When you do a search, you want to have some kind of order that lets you (for example) do a binary search to find the data in logarithmic time instead of searching through every record to find the one(s) you care about (that's not the only type of index, but it's probably the most common).
Unfortunately, you can only arrange the records themselves in a single order.
An index contains just the data (or a subset of it) that you're going to use to search on, and pointers (or some sort) to the records containing the actual data. This allows you to (for example) do searches based on as many different fields as you care about, and still be able to do binary searching on all of them, because each index is arranged in order by that field.
Because the index in the DB and in the given example is sorted alphabetically. The raw table / book is not. Then think: How do you search an index knowing it is sorted? I guess you don't start reading at "A" up to the point of your interest. Instead you skip roughly to the POI and start searching from there. Basically a DB can to the same with an index.
It is faster because the index contains only values from the column in question, so it is spread across a smaller number of pages than the full table. Also, indexes usually include additional optimizations such as hash tables to limit the number of reads required.

Resources