Find commonly joined queries in Redshift - join

I want to get a list of the most frequently joined tables in our Redshift. Ideally with the join conditions. Reason: we're adding sortkeys and distkeys, and trying to be relatively thorough (sidenote: if you have any good tips for optimizing query runtimes, I'm eager to hear).
I know I can query STL_QUERY to get querytext, runtimes, etc. But aside from doing some manual text analysis, any way to see which tables are merged by query id?

As far as I know, there is not "STL" table in redshift, that can readily give out this information. As you mentioned, you would need to look at all the queries in STL_QUERYTXT table and search for joins.
In terms of general performance tuning suggestions, I would suggest you look at persicope's blog if you havent already. And there is this.

Related

graph databases for store nested comments

I'm designing a data model for a nested comment system. like Reddit.
I read so many blogs throughout the internet, all the solution I found is trying to build a hierarchical data structure in relational databases using designs like Adjacency list, Path enumeration, Closure table or Nested Sets. along with the different pros and cons, all of these are felt hacky solutions because of the lack of SQL support. MongoDB seems another good NoSQL solution with 100 nested levels and 16MB size limitation.
the solution which I'm seeking for needs to be fast at reading(50 RPS). slow insert and delete are ok. I'm expecting to filter and sort comments by ranking.
Can I use a graph database like neo4j, AWS Neptune for this requirement? will it suitable or over-engineering?

ActiveRecord Views vs Tables (in Rails 4)

I'm only just starting to learn about views in ActiveRecord from reading a few blog posts and some tutorials on how to set them up in Rails.
What I would like to know is what are some of the pros and cons of using a View instead of a query on existing ActiveRecord tables? Are there real, measurable performance benefits of using a view?
For example, I have a standard merchant application with orders, line items, and products. For an admin dashboard, I have various queries, many of which ping a query that I reuse a lot in the code - namely one that returns user_id, order_id and total_revenue from that order. From a business perspective, many good stats are based off that core query. At what point does it make sense to switch to to a view instead?
The ActiveRecord docs on Views are also a bit sparse so any references to some good resources on both the why and how would be greatly appreciated.
Clarification: I am not talking about HTML views but SQL database views. Essentially, are the performance wins maintained in an ActiveRecord implementation? Since the docs are sparse, are there any potential obstacles to those performance wins that could be lost if you implement them incorrectly (i.e., any non-obvious gotchas)?
I got this information from another developer off-line, so I am answering my own question here in case other people stumble upon it.
Basically, ActiveRecord starting in Rails 3.1 began implementing prepared statements which preprocess and cache SQL statement patterns ahead of time which later allow faster query responses. You can read more about it in this blog post.
This actually might result in not much benefit in switching to views in PostgreSQL since views may not perform much better than prepared statements in PG.
The PostgreSQL documentation on prepared statements seems clear and well-written. More thoughts on PostgreSQL views performance can be found in this stackoverflow post.
Additionally, it's probably much more likely that your Rails app has performance issues due to N+1 queries - this is a great post that explains the problem and one of the easiest ways to prevent it with eager loading.
This question is not directly related to ActiveRecord and it seems more a database related question to me. The answer is "it depends". Many times people use view because they want to:
represent a subset of the data contained in a table
simplify your query by join multiple tables into a virtual table represented by the view
do aggregation in the view
hide complexity of your data
for some security reasons
etc.
But most of the aforementioned features can be implemented by using raw tables. It's just a little complicated than using views. Another place you may considering using a view is for performance reason. That's materialized view or indexed view in SQL Server. Basically it saved a copy of your data in the view in the form you want and it can greatly boost performance.

Modifying Cypher Query Engine

I would like to modify the way Cypher processes queries sent to it for pattern matching. I have read about Execution plans and how Cypher chooses the best plan with the least number of operations and all. This is pretty good. However I am looking into implementing a Similarity Search feature that allows you to specify a Query graph that would be matched if not exact, close (similar). I have seen a few examples of this in theory. I would like to implement something of this sort for Neo4j. Which I am guessing would require a change in how the Query Engine deals with queries sent to it. Or Worse :)
Here are some links that demonstrate the idea
http://www.cs.cmu.edu/~dchau/graphite/graphite.pdf
http://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper72.pdf
I am looking for ideas. Anything at all in relation to the topic would be helpful. Thanks in advance
(:I)<-[:NEEDING_HELP_FROM]-(:YOU)
From my point of view, better for you is to create Unmanaged Extensions.
Because you can create you own custom functionality into Neo4j server.
You are not able to extend Cypher Language without your own fork of source code.

Is it wise to use Google Tables as a Database?

I just found out about the Unhosted Movement.
I understand the points being made about the advantages over the classical web app approaches including a database being a sql or non-sql database.
From my point of view there are concerns regarding security and privacy. I believe the disadvantages outweigh the advantages. Especially if sensitive Data is involved.
I would love to hear about more pros/cons and experiences from you guys. Personally I would rather use Laravel/RoR or a similar Framework with scaffolding etc.
I'm about to try that. As far as security/privacy is concerned, you can grant limited access to tables and use ssl. Google still knows everything of course.
But fusion tables isn't a full blown database after all. Its sql is highly limited, you have no joins in SELECT, no GROUP BY in views and only left outer joins, no subqueries, no EXISTS clause, no users/transactions/locking/isolation levels etc, what might be the reason to use a database in the first place. It is also not meant to be that. There are also no standard connectors I'm aware of, so you'll have to use the API. The last post asking for a JDBC driver is some years old, and there still isn't any.

When to not use neo4j?

Neo4j is a great tool for mapping relational data, but I am curious what under what conditions it would not be a good tool to use.
In which use cases would using neo4j be a bad idea?
You might want to check out this slide deck and in particular slides 18-22.
Your question could have a lot of details to it, but let me try to focus on the big pieces. Graph databases are naturally indexed by relationships. So graph databases will be good when you need to traverse a lot of relationships. Graphs themselves are very flexible, so they'll be good when the inter-connections between your data need to change from time to time, or when the data about your core objects that's important to store needs to change. Graphs are a very natural method of modeling some (but not all) data sources, things like peer to peer networks, road maps, organizational structures, etc.
Graphs tend to not be good at managing huge lists of things. For example, if you were going to build a customer transaction database with analytics (where you need 1 million customers, 50 million transactions, and all you do is post transactions all day long) then it's probably not a good fit. RDBMS is great at that, notice how that use case doesn't exploit relationships really.
Make sure to read those two links I provided, they have much more discussion.
For maintenance reasons, any service aggregating data feeds has until now been well advised to keep their sources independent.
If I want to explore relationships between different feeds, this can be done at application level, using data tracking (for example) user preferences amongst the other feeds.
Graph databases are about managing relationship complexity, but this complexity is in many cases a design choice. Putting all your kids in one bathtub is fine until you drop the soap..

Resources