Being somewhat new to search engines, the notions of indexes and types are not very clear to me. Elastic search has the notion of indexes and types where you can store a document.
Does the notion of an index correlate with a schema in a database?
While the notion of a type correlate with a table?
Can someone please explain the purpose of having another grouping below indexes?
Why can't we store all documents of the same type on a single index?
Does the notion of an index correlate with a schema in a database? While the notion of a type correlate with a table?
No and no. First, ElasticSearch is schema free: you don't have to specify upfront the structure of your documents. Just throw some JSON at ElasticSearch and it will happily index it, store it, retrieve it, search it.
The concept of index correlates to the notion of database: a database contains many tables, eg. heterogenously structured data.
The notion of type correlates to the notion of table: various types stored under one index can have different mapping, ie. different analyzers for fields, etc.
Another way how to look at types would be to look at them as column families in column databases such as HBase or Cassandra.
There is actually a very nice example in the ElasticSearch README: storing two different types of data (users and their tweets) in one index, named “twitter”.
(All that said, nobody forces you to exploit this feature: you can have one type under an index, if it makes sense for you.)
Related
I am quite a newbie to Cube.js. I have been trying to integrate Cube.js analytics functionality with my Ruby on Rails app. The database is PostgreSQL. In a database, there is a certain column called answers_json with jsonb data type which contains a nested hash. An example of data of that column is:
**answers_json:**
"question_weights_calc"=>
{"314"=>{"329"=>1.5, "331"=>4.5, "332"=>1.5, "333"=>3.0},
"315"=>{"334"=>1.5, "335"=>4.5, "336"=>1.5, "337"=>3.0},
"316"=>{"338"=>1.5, "339"=>3.0}}
There are many more keys in the same column with the same hash structure as shown above. I posted the specific part because I would be dealing with this part only. I need assistance with accessing the values in the hash. The column has a nested hash. In the example above, the keys "314", "315" and "316" are Category IDs. The keys associated with Category ID "314" are "329","331","332", "333"; which are Question IDs. Each category will have multiple questions. For different records, the category and question IDs will be dynamic. For example, for another record, Category ID and Question IDs associated with that category id will be different. I need to access the values associated with the key question id. For example, to access the value "1.5" I need to do this in my schema file:
**sql: `(answers_json -> 'question_weights_calc' -> '314' ->> '329')`**
But the issue here is, those ids will be dynamic for different records in the database. Instead of "314" and "329", they can be some other numbers. Adding different record's json here for clarification:
**answers_json:**
"question_weights_calc"=>{"129"=>{"273"=>6.0, "275"=>15.0, "277"=>8.0}, "252"=>{"279"=>3.0, "281"=>8.0, "283"=>3.0}}}
How can I know and access those dynamic IDs and their values since I also need to perform mathematical operations on values. Thanks!
As a general rule, it's difficult to run SQL-based reporting on highly dynamic JSON data. Postgres does have some useful functions for dealing with JSON, and you might be able to use json_each or json_object_keys plus a few joins to get there, but its quite likely that the performance and maintainability of such a query would be difficult to say the least 😅 Cube.js ultimately executes SQL queries, so if you do go the above route, the query should be easily transferrable to a Cube.js schema.
Another approach would be to create a separate data processing pipeline that collects all the JSON data and flattens it into a single table. The pipeline should then store this data back in your database of choice, from where you could then use Cube.js to query it.
I have setup a time-series / events database using the AWS Firehose -> S3/Glue -> Athena stack. It is being used to track various user actions - session started, action performed etc. across a number of our products. My question is about how best to store different types of IDs in this system.
The existing schema is one big 'fact table' with a bunch of different columns. Two of the most important columns are event_type_id and object_id. To use StackOverflow as an example, two events might be:
question_asked - in this case I would be storing the question id in the object_id column.
tag_created - in this case I would be storing the tag id in the object_id column.
My question is - is storing multiple different types of IDs in the same column bad practice? It's working OK for us at the moment, but it does require the person/system performing queries to know what type of object the object_id column refers to, based on the event they are querying.
If bad practice, what other approaches might be better? Multiple columns where they are NULL if not relevant for the event in that row? Or is this where dimension tables would be a better fit?
This isn't necessarily bad practice, depending on how you use it.
It sounds like you're aware of the potential pitfalls of such an approach (i.e. users of the data have to be aware of the context - in this case "event type" - to use the values correctly), so as you're using Athena you could mitigate that by creating views over source table for different event types, inserting a WHERE clause filter on event type and possibly renaming object_id to something more context specific e.g. question_id.
This makes it easier for users to work with the data and understand exactly what the values are they're working with.
In a big data environment I wouldn't recommend creating dimension tables if it can be avoided as JOINs between tables start to get expensive. Having multiple columns for different ids is possible but then you create new problems for users such as having to account for NULL values in an Id column, and this also potentially makes it harder to add new event types and ids as you have to change the schema to accommodate them.
I have an app that consists mainly of restaurant model instances. One of the essential attributes for these restaurants is labeling the cuisine it falls under. I'm currently at odds with myself in regards to designing this. On one hand I thought of creating a Cuisine model and creating either a HMT or HABTM association between Restaurants and Cuisines.
More recently I came across this post which shows how to create a pre-defined set of attributes. To take the answer one step further I'm assuming (in my case) I'd add a string-based cuisine column to my restaurant model and setup a select box in my restaurant form that would save the selected value.
What I was wondering was what would be the most efficient way of doing this? The goal is to eventually be able to query restaurants based what cuisine(s) they fall under. I wasn't sure if a model would be the best choice due to it only serving as a join table in a sense with a name attribute. Wasn't sure if having this extra table for something so minute would be optimal.
On the other hand I didn't know if using YAML for this would be conducive since the values are essentially dummy strings with no tangible records on file like I'd have with a model instance. Can someone help me sort out this confusion?
There are many benefits of normalizing many-to-many relationships in the db. Here are some:
Searching, sorting, and creating indexes is faster, since tables are narrower, and more rows fit on a data page.
You can have more clustered indexes (one per table), so you get more flexibility in tuning queries.
Index searching is often faster, since indexes tend to be narrower and shorter.
More tables allow better use of segments to control physical placement of data.
You usually have fewer indexes per table, so data modification commands are faster.
Fewer null values and less redundant data, making your database more compact.
Triggers execute more quickly if you are not maintaining redundant data.
Data modification anomalies are reduced.
Normalization is conceptually cleaner and easier to maintain and change as your needs change.
Also, by normalizing you get the cleaner syntax and other infrastructure benefits from ActiveRecord, e.g.
cuisine.restaurants.where(city: 'Toledo')
I would like to know whether I am appropriately using table attributes to describe objects or whether there is a more efficient and advisable approach for certain types of attributes.
I have two ActiveRecord tables, foods and lists. Both tables have many columns because each object has many attributes (calories, fat, protein, etc.).
In addition to these intrinsic characteristics, I find myself adding columns to the table for attributes that represent an object’s membership in a group or user-defined properties.
Group membership data indicate whether a food is a dessert or a meat, among other categories. For this, I have columns with binary or categorical (char) data.
User-defined property data include “maximum calories” or “maximum fat” attributes for a list. If I have a column for “maximum” corresponding to each “total” (e.g., “maximum calories” and “total calories”), this of course doubles the number of columns.
Dessert and meat are intrinsic properties in that they cannot be altered by the user, but it seems they could be more efficiently represented by an array of food ids or a hash. Having so many data points (and columns) to represent this simple categorization seems redundant, and my tables are so big. The reason I have not switched to arrays for group membership data is because I like how this data is currently accessible by the object itself. It’s intuitive, centralized, and seemingly less error-prone.
I don’t have an idea for how else I would manage user-defined “maximums” for lists, and maybe this proliferation of columns/attributes is the best option.
I would appreciate any advice or appraisal of my approach and suggestions of possible alternatives.
you can use serialize and store an text object in the database but when you selecting it will be accessed as an HASH or ARRAY again, this can solve the problem for you instead of duplicating fields store it as HASH and then read from it.
check out this:
Rails: Serializing objects in a database?
http://apidock.com/rails/ActiveRecord/Base/serialize/class
I am currently building a Rails app where there is a "documents" data table that stores references to pdfs living on an S3 server. These documents could have 100 different types. Each type can have up to 20 attributes or meta info.
My dilemma is do I make 100 relational tables for every doc type or just create one key/value data table with a reference to the doc_id.
My gut tells me to go key/value for flexibility for searching and supporting more and more document types over time without having to create new migrations. However, I know there are pitfalls with this technique. My first concern of course is the size of the table. The key/value table could end up with millions of rows.
On the other hand, having 100 attribute tables would be nightmare to query against in a full text search situation.
So bottom line is, by going with key/value, is performance on a 3 column Postgres table with potentially millions of rows a scaling problem? Also what about joins on the value field?
This data would almost never change by the way. So it would be 90% reads.
Consider a single table with an hstore column. It is a PostgreSQL data type designed for storing key/value pairs.
http://www.postgresql.org/docs/9.1/static/hstore.html
There are also multiple Ruby gems that add hstore support to ActiveRecord. Here is one that I wrote: https://github.com/JackC/surus You can search ruby gems for about a dozen more alternatives as well.