Fetch data from multiple tables and sort all by their time - ruby-on-rails

I'm creating a page where I want to make a history page. So I was wondering if there is any way to fetch all rows from multiple tables and then sort by their time? Every table has a field called "created_at".
So is there any way to fetch from all tables and sort without having Rails sorting them form me?

You may get a better answer, but I would presume you would need to
Create a History table with a Created date column, an autogenerated Id column, and any other contents you would like to expose [eg Name, Description]
Modify all tables that generate a "history" item to consume this new table via Foreign Key relationship on History.Id
"Mashing up" tables [ie merging different result sets into a single result set] is a very difficult problem, but you would effectively be doing the above anyway - just in the application layer, so why not do it correctly and more efficiently in the data layer.
Hope this helps :)

You would need to perform the sql like:
Select * from table order by created_at incr
: Store this into an array. Do this for each of the data sources, and then perform a merge sort on all the arrays in Ruby. Of course this will work well for small data sets, but once you get a data set that is large (ie: greater than will fit into memory) then you will have to use a different collect/merge algorithm.
So I guess the answer is that you do need to perform some sort of Ruby, unless you resort to the Union method described in another answer.

Depending on whether these databases are all on the same machine or not:
On same machine: Use OrderBy and UNION statements in your sql to return your result set
On different machines: You'll want to test this for performance, but you could use Linked Servers and UNION, ORDER BY. Alternatively, you could have ruby get the results from each db, and then combine them and sort
EDIT: From your last comment about different tables and not DB's; use something like this:
SELECT Created FROM table1
UNION
SELECT Created FROM table2
ORDER BY created

Related

How to create staging table to handle incremental load

We are designing a Staging layer to handle incremental load. I want to start with a simple scenario to design the staging.
In the source database There are two tables ex, tbl_Department, tbl_Employee. Both this table is loading a single table at destination database ex, tbl_EmployeRecord.
The query which is loading tbl_EmployeRecord is,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
Now, we need to identify incremental load in tbl_Department, tbl_Employee and store it in staging and load only the incremental load to the destination.
The columns of the tables are,
tbl_Department : DEPARTMENTID,DEPTNAME
tbl_Employee : EMPID,EMPNAME,DEPARTMENTID
tbl_EmployeRecord : EMPID,EMPNAME,DEPTNAME
Kindly suggest how to design the staging for this to handle Insert, Update and Delete.
Identifying Incremental Data
The incremental loading needs to be based on some segregating information present in your source table. Such information helps you to identify the incremental portion of the data that you will load. Often times, load date or last updated date of the record is a good choice for this.
Consider this, your source table has a date column that stores both the date of insertion of the records as well as the date when any update was done on that record. At any given day during your staging load, you may take advantage of this date to identify which are the records that are newly inserted or updated since your last staging load and you consider only those changed / updated records as your incremental delta.
Given your structure of the tables, I am not sure which column you may use for this. ID columns will not help as if the record gets updated you won't know that.
Maintaining Load History
It is important to store information about how much you have loaded today so that you can load the next part in the next load. To do this, maintain a staging table - often called Batch Load Details table. That load typically will have structure such as below:
BATCH ID | START DATE | END DATE | LOAD DATE | STATUS
------------------------------------------------------
1 | 01-Jan-14 | 02-Jan-14 | 02-Jan-14 | Success
You need to insert a new record in this table everyday before you start the data loading. The new record will have start date equal to the end date of last successful load and status null. Once loading is successful, you will update the status to 'Success'
Modification in data Extraction Query to take Advantage of Batch Load Table
Once you maintain your loading history like above, you may include this table in your extraction query,
SELECT EMPID,EMPNAME,DEPTNAME
FROM tbl_Department D
INNER JOIN tbl_Employee E
ON D.DEPARTMENTID=E.DEPARTMENTID
WHERE E.load_date >= (SELECT max(START_DATE) FROM BATCH_LOAD WHERE status IS NULL)
What I am going suggest you is by no means a standard. In fact you should evaluate my suggestion carefully against your requirement.
Suggestion
Use incremental loading for transaction data, not for master data. Transaction data are generally higher in volume and can be easily segregated in incremental chunks. Master data tend to be more manageable and can be loaded in Full everytime. In the above example, I am assuming your Employee table is behaving like transactional data whereas your department table is your master.
I trust this article on incremental loading will be very helpful for you
I'm not sure what database you are using, so I'll just talk in conceptual terms. If you want to add tags for specific technologies, we can probably provide specific advice.
It looks like you have 1 row per employee and that you are only keeping the current record for each employee. I'm going to assume that EMPIDs are unique.
First, add a field to the query that currently populates the dimension. This field will be a hash of the other fields in the table EMPID, EMPNAME, DEPTNAME. You can create a view, populate a new staging table, or just use the query. Also add this same hash field to the dimension table. Basically, the hash is an easy way to generate a field that is unique for each record and efficient to compare.
Inserts: These are the records for which the EMPID does not already exist in the dimension table but does exist in your staging query/view.
Updates: These are the records for which the EMPID does in both the staging query/view the dimension table, but the hash field doesn't match.
Deletes: These are the records for which the EMPID exists in the dimension but does not exist in the staging query/view.
If this will be high-volume, you may want to create new tables to hold the records that should be inserted and the records that should be updated. Once you have identified the records, you can insert/update them all at once instead of one-by-one.
It's a bit uncommon to delete lots of records from a data warehouse as they are typically used to keep history. I would suggest perhaps creating a column that is a status or a bit field that indicates if is is active or deleted in the source. Of course, how you handle deletes should be dependent upon your business needs/reporting requirements. Just remember that if you do a hard delete you can never get that data back if you decide you need it later.
Updating the the existing dimension in place (rather than creating historical records for each change) is called a Type 1 dimension in dimensional modeling terms. This is fairly common. But if you decide you need to keep history, you can use the hash to help you create the SCD type 2 records.

Optimizing JOINs : comparison with indexed tables

Let's say we have a time consuming query described below :
(SELECT ...
FROM ...) AS FOO
LEFT JOIN (
SELECT ...
FROM ...) AS BAR
ON FOO.BarID = BAR.ID
Let's suppose that
(SELECT ...
FROM ...) AS FOO
Returns many rows (let's say 10 M). Every single row has to be joined with data in BAR.
Now let's say we insert the result of
SELECT ...
FROM ...) AS BAR
In a table, and add the ad hoc index(es) to it.
My question :
How would the performance of the "JOIN" with a live query differ from the performance of the "JOIN" to a table containing the result of the previous live query, to which ad hoc indexes would have been added ?
Another way to put it :
If a JOIN is slow, would there be any gain in actually storing and indexing the table to which we JOIN to ?
The answer is 'Maybe'.
It depends on the statistics of the data in question. The only way you'll find out for sure is to actually load the first query into a temp table, stick a relevant index on it, then run the second part of the query.
I can tell you if speed is what you want, if it's possible for you load the results of your first query permanently into a table then of course your query is going to be quicker.
If you want it to be even faster, depending on which DBMS you are using you could consider creating an index which crosses both tables - if you're using SQL Server they're called 'Indexed Views' or you can also look up 'Reified indexes' for other systems.
Finally, if you want the ultimate in speed, consider denormalising your data and eliminating the join that is occurring on the fly - basically you move the pre-processing (the join) offline at the cost of storage space and data consistency (your live table will be a little behind depending on how frequently you run your updates).
I hope this helps.

Avoiding round-trips when importing data from Excel

I'm using EF 4.1 (Code First). I need to add/update products in a database based on data from an Excel file. Discussing here, one way to achieve this is to use dbContext.Products.ToList() to force loading all products from the database then use db.Products.Local.FirstOrDefault(...) to check if product from Excel exists in database and proceed accordingly with an insert or add. This is only one round-trip.
Now, my problem is there are two many products in the database so it's not possible to load all products in memory. What's the way to achieve this without multiplying round-trips to the database. My understanding is that if I just do a search with db.Products.FirstOrDefault(...) for each excel product to process, this will perform a round-trip each time even if I issue the statement for the exact same product several times ! What's the purpose of the EF caching objects and returning the cached value if it goes to the database anyway !
There is actually no way to make this better. EF is not a good solution for this kind of tasks. You must know if product already exists in database to use correct operation so you always need to do additional query - you can group multiple products to single query using .Contains (like SQL IN) but that will solve only check problem. The worse problem is that each INSERT or UPDATE is executed in separate roundtrip as well and there is no way to solve this because EF doesn't support command batching.
Create stored procedure and pass information about product to that stored procedure. The stored procedure will perform insert or update based on the existence of the record in the database.
You can even use some more advanced features like table valued parameters to pass multiple records from excel into procedure with single call or import Excel to temporary table (for example with SSIS) and process them all directly on SQL server. As last you can use bulk insert to get all records to special import table and again process them with single stored procedures call.

Is a full list returned first and then filtered when using linq to sql to filter data from a database or just the filtered list?

This is probably a very simple question that I am working through in an MVC project. Here's an example of what I am talking about.
I have an rdml file linked to a database with a table called Users that has 500,000 rows. But I only want to find the Users who were entered on 5/7/2010. So let's say I do this in my UserRepository:
from u in db.GetUsers() where u.CreatedDate = "5/7/2010" select u
(doing this from memory so don't kill me if my syntax is a little off, it's the concept I am looking for)
Does this statement first return all 500,000 rows and then filter it or does it only bring back the filtered list?
It filters in the database since your building your expression atop of an ITable returning a IQueryable<T> data source.
Linq to SQL translates your query into SQL before sending it to the database, so only the filtered list is returned.
When the query is executed it will create SQL to return the filtered set only.
One thing to be aware of is that if you do nothing with the results of that query nothing will be queried at all.
The query will be deferred until you enumerate the result set.
These folks are right and one recommendation I would have is to monitor the queries that LinqToSql is creating. LinqToSql is a great tool but it's not perfect. I've noticed a number of little inefficiencies by monitoring the queries that it creates and tweaking it a bit where needed.
The DataContext has a "Log" property that you can work with to view the queries created. I created a simple HttpModule that outputs the DataContext's Log (formatted for sweetness) to my output window. That way I can see the SQL it used and adjust if need be. It's been worth its weight in gold.
Side note - I don't mean to be negative about the SQL that LinqToSql creates as it's very good and efficient almost every time. Another good side effect of monitoring the queries is you can show your friends that are die-hard ADO.NET - Stored Proc people how efficient LinqToSql really is.

How filter a dataset based on a nested dataset record count?

I have a Dataset i want to apply a filter based on a dataset-type field record count, something like: 'NESTED_DATASET_FIELD.RecordCount > 0'
If the dataset comes from a SQL based storage engine, use a select distict query on the joined table with only fields from the master table in the result set. Let the SQL engine do the work for you.
Depending on your situation, you can use:
In OnFilterRecord event you can have:
Accept := myDataSetField.NestedDataSet.RecordCount>0;
If you have a SQL backend you can use the Exists or Count in order to fetch only the records which you need. Perhaps is the best approach if you are over a network. However I don't know what infrastructure you have.
In OnFilterRecord event you can have:
Accept := not myDataSetField.IsNull;
//Just testing if the DataSet field is empty - which is one of the fastest ways to do it
...but this depends on the structure of your data / dataset etc.
Sometimes is better to have a dedicated field in your DataSet / Table to specify this status because usually getting such info from the nested dataset can be expensive. (One must fetch it at least partially etc.)
Also, for the same considerations (see 4. above) perhaps you can have a Stored Procedure (if your DB backend permits) to get this info.
HTH

Resources