How best to map this structure onto amazon SimpleDB - amazon-simpledb

So simpledb has a kind of spreadsheet data model.
I have an app that simply needs to store keys against values. Except that a single key can have multiple values.
There will be multiple clients. Each client has an id with it's own set of keys.
I'd like to stick with a single domain if I can at this stage.
How can I map this onto simpleDB?
I was thinking
domain = mydomain
item = clientid
attribute.n.name = key_1 ... key_n
attribute.n.value = val1 ... valn
That would satisfy the ability to store multiple values for the same key.
But then I found that I need to either get ALL attributes in my select or know example
how many attributes I have. I will not know this up front.
Also I allow deleting a specific value from a key (or attribute). I will have to search for it first. It seems that in the select there is no attributeName() function, just the itemName() function.
Would it perhaps be better to make the item name a combination of id + key + _n ?
e.g. if the id is 'myid' and the key is 'boots' then the item name would be
'myidboots_1'
And then have a single attribute per item called say 'keyval'.
and I can do a
select 'keyval' where itemName like 'myidboots_%' ?
Still kindof cumbersome compared to a normal sql database.
Maybe I should try encoding the values like a comma separated list?
Except that it's probably more cumbersome and also I've read that there is a 1000 character limit.
Any other suggestions?

I'm not sure I totally follow your question, but I think it might be helpful to point out that SimpleDB lets you do classic SQL style queries like:
select * from foo where bar = '1'
This will return all the attributes/values for the resulting records.

Related

SSRS How to pass MANY multi-value parms to URL

We have a report that has 4 multi-value parameters. Each value would be 20 characters and some parms have > 30 choices. If a user should select All, passing the parameters as &dept=10&dept=20, etc. won't work because it will exceed the limit of 2048 for a URL. Is there another way to pass them? Here is what part of the URL looks like with just one value for each parm except Department:
... &p_IncidentStartDate=2015-01-01&p_IncidentEndDate=2016-02-25&p_DepartmentId=QDP00000000000000041&p_DepartmentId=QDP00000000000000008&p_DepartmentId=QDP00000000000000011&p_TouchpointId=QSE00000000000000075&p_GeneralIssueId=DSE00000000000000021&p_SpecificComplaintId=DSE00000000000000054&p_IsSubmitted=1&p_formGiftCardStatus=Pending
The stored procedure that is used in the dataset uses a "SplitString" function to separate the parameters with commas. So, when if I run the proc manually, I pass parms like this:
Is there some other way to pass the parameters other than one value at a time? (I've read lots of posts but can't find an answer that works.)
For one, you could have a manual option to select called "All" and change your query's logic to look something like this:
SELECT *
FROM TABLE
WHERE (p_DepartmentId IN (#p_DepartmentId) OR #p_DepartmentId = 'All')
I assume p_DepartmentId is the PK? It's unfortunate it's so long. I would look into the validity of this answer. But I think this latter solution involves a lot, lot of extra work. It might be easier to create a custom "All" entry like I've shown above. Just bake it into the query that returns the dataset for the departments, like this.
SELECT 'All' AS p_DepartmentId
UNION
SELECT p_DepartmentId FROM TABLE

Mongodb: Is it a good idea to create a unique index on web URLs?

My document looks like:
{"url": "http://some-random-url.com/path/to/article"
"likes": 10
}
The url needs to be unique. Is it a good idea to have a unique index on the url? The URL can be long, resulting in larger index size, more memory footprint, and slower overall performance. Is it a good idea to generate a hash from the url (i am thinking about using murmur3) and create a unique index on that instead. I am assuming that the chances of collision are pretty low, as described here: https://softwareengineering.stackexchange.com/questions/49550/which-hashing-algorithm-is-best-for-uniqueness-and-speed
Does anyone see any drawbacks to this approach? The new document will look like (with a unique index on u_hash instead of url):
{"url": "http://some-random-url.com/path/to/article"
"likes": 10
"u_hash": "<murmur3 hash of url>"
}
UPDATE
I will not be doing regex queries on the url. Will be doing only a complete URL look up. I am more concerned about the performance of this look up, as I believe it will also be used internally by mongodb to maintain unique index, and hence affecting write performance as well (+ longer index). Additionally, my understanding is that mongobd doesn't perform well for long text indexes, as it wasn't designed for that purpose. I may be wrong though, and it could only depend on whether or not that index fits into RAM. Any pointers?
I'd like to expand on the answer of #AlexRyan. While he is right in general, there are some things which need to be taken into consideration for this use case.
First of all, we have to differentiate between a unique index and the _id field.
When the URL needs to be unique in your use case, there has to be a unique index. What we have to decide is wether to use the URL itself or a hashed value of it. The hashing itself would not help with the search, as the hash sum saved in a field would be treated as a string by MongoDB. It may safe space (URLs may be shorter than their hash value), hereby reducing the memory needed for the index. However, doing so takes away the possibility to search for parts of the URL in the index, for example with
db.collection.find({url:{$regex:/stackoverflow/}})
With a unique index on url, this query would use an index, which will be quite fast. Without such (unique) index, this query will result in a comparably slow collection scan.
Plus, creating the hash each and every time before querying, updating or inserting doesn't make these operations faster.
This leaves us with the fact that creating a hash sum and a unique index on it may save some RAM at the cost of making queries on the actual field slower at orders of magnitude. And it introduces the need of creating a hash sum each and every time. Having a index on both the URL and it's hashed value would not make sense at all.
Now to the question wether it is a good idea to use URL as _id one way or the other. Since URLs usually are distinct by nature (they are supposed to return the same content) and the likes are related to that uniqueness, I would tend to use the URL as the id. Since you need the unique index on _id anyway, it serves two purposes here: you have your id for the document, you ensure uniqueness of the URL and - in case you use the natural representation of the URL - it will even be queryable in an efficient way.
Use a unique index on url
db.interwebs.ensureIndex({ "url" : 1}, { "unique" : 1 })
and not a hashed index. Hashed indexes in MongoDB are meant to be used for hashed shard keys and not for unique constraints. From the hashed index docs,
Hashed indexes support sharding a collection using a hashed shard key. Using a hashed shard key to shard a collection ensures a more even distribution of data.
and
You may not create compound indexes that have hashed index fields or specify a unique constraint on a hashed index
If url needs to be unique and you will use it to look up documents, it's absolutely worth having a unique index on url. If you want to use url as the primary key for documents, you can store the url value in the _id field. This field is normally a driver-generated ObjectId but it can be any value you like. There's always a unique index on _id in a MongoDB collection so you get the unique index "for free".
I think the answer is "it depends".
Choosing keys that have no real world meaning embedded in them may save you pain in the future. This is especially true if you decide you need to change it but you have a lot of foreign keys referencing it.
Most database management systems offer you a way to generate unique IDs.
In Oracle, you might use a sequence.
In MySQL you might use AUTO_INCREMENT when you define the table itself.
The way that mongodb assigns unique ids to documents is different than in relational databases. They use ObjectIDs for this purpose.
One of the interesting things about ObjectIDs is that they are generated by the driver.
Because of the algorithm that is used to generate them, they are guaranteed to be unique even if you have a large cluster of app and database servers.
You can learn more about them here:
http://docs.mongodb.org/manual/reference/object-id/
A lot of engineering work has gone into ensuring that ObjectIds unique.
I use them by default unless there is a really good reason not to.
So far, I have not found a really good reason to not use them.

Assign Key Field Value Only If Corresponding Lookup Result value Exist

I have ten master tables and one Transaction table. In my transaction table (it is a memory table just like ClientDataSet) there are ten lookup fields pointing to my ten master tables.
Now i am trying to dynamically assigning key field values to all my lookup key field values (of the transaction table) from a different Server(data is coming as a soap xml). Before assigning these values i need to check whether the corresponding result value is valid in master tables or not. I am using a filter (eg status = 1 ) to check whether it is valid or not.
Currently how we are doing is, before assigning each key field value we are filtering the master tables using this filter and using the locate function to check whether it is there or not. and if located we will assign its key field value.
This will work fine if there is only few records in my master tables. Consider my master tables having fifty thousand records each (yeah, customer is having so much data), this will lead to big performance issue.
Could you please help me to handle this situation.
Thanks
Basil
The only way to know if it is slow, why, where, and what solution works best is to profile.
Don't make a priori assumptions.
That being said, minimizing round trips to the server and the amount of data transferred is often a good thing to try.
For instance, if your master tables are on the server (not 100% clear from your question), sending only 1 Query (or stored proc call) passing all the values to check at once as parameters and doing a bunch of "IF EXISTS..." and returning all the answers at once (either output params or a 1 record dataset) would be a good start.
And 50,000 records is not much, so, as I said initially, you may not even have a performance problem. Check it first!

How to order the data back from Amazon simpleDB int specific column order

I'm using Amazon's SimpleDB Java client to read data from SimpleDB. The problem I have is even though I specified the columns in the some order in the SelectRequest like the following:
SelectRequest req = new SelectRequest("SELECT TIMESTAMP, TYPE, APP, http_status, USER_ID from mydata");
SElectResult res = _sdb.select(req);
..
It returned data in following column order:
APP, TIMSTAMP, TYPE, USER_ID, http_status,
It seems it automatically reordered the columns in ascend order. Is there any way I can force the order as I specified in the select clause?
The columns returned are not an ordered list but an unordered set of attributes. You can't control the order they come back in. SELECT is designed to work even in cases where some of the attributes in your query don't exist for every (or any) returned items. In those cases specifically you wouldn't be able to rely on order anyway. I realize that's small consolation if you have structured your data set so that the attributes are always present.
However, since you know the desired order ahead of time, it should be pretty easy to pull the data out of the result in the proper order. It's just XML after all, or in the case of the Java client, freshly parsed XML.
The Select operation returns a set of Attributes for ItemNames that match the select expression.
SimpleDB docs for SELECT

Fetch data from multiple tables and sort all by their time

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?
You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)
You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.
Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Resources