Best way to cleanse huge amounts of data - data-warehouse

I have a huge table in my Oracle database - around 40 million rows - that I have to check that all data are correct inside fields based on some business rules.
Let us assume this is my huge table, and the rules are:
US only can have [national_id_type]=NID.. others must be VISA
US only with NA can own house.. other nationality with VISA can't own a house
What I think
First to land the table inside a landing table to preserve the original data
Rewrite error in the column to display the errors and the number of them inside a PowerBI dashboard so the quality department can fix them
What I will do is
case when nationality <> 'US' and NATINAL_ID_TYPE = 'VISA' then '99\US only can have [national_id_type]=NID' ELSE NATINAL_ID_TYPE END as NATINAL_ID_TYPE ,
case when nationality <> 'US' and NATINAL_ID_TYPE <> 'NA' AND own_house='YES' then '99\US only with NA can own house' ELSE own_house END as own_house
Is this way efficient to detect errors in 40 million Record X 70 Columns
I think this will be very slow...
Can you help me with some tips I can use to provide the errors for dashboarding and cleansing that can be used for performance

Related

How to count cypher labels with specific condition?

I have a graph database with information about different companies and their subsidiaries. Now my task is to display the structure of the company. This I have achieved with d3 and vertical tree.
But additionally I have to write summary statistics about the company that is currently displayed. Companies can be chosen from a dropdown list which is fetching this data dynamically via AJAX call.
I have to write in the same HTML a short summary like :
Total amount of subsidiaries for CompanyA: 300
Companies in Corporate Havens : 45%
Companies in Tax havens 5%
My database consists of two nodes: Company and Country, and the country has label like CH and TH.
CREATE (:TH:Country{name:'Nauru', capital:'Yaren', lng:166.920867,lat:-0.5477})
WITH 1 as dummy MATCH (a:Company), (b:Country) WHERE a.name=‘CompanyA ' AND b.name='Netherlands' CREATE (a)-[:IS_REGISTERED]->(b)
So how can I find amount of subsidiaries of CompanyA that are registered in corporate and tax havens? And how to pass this info further to html
I found different cypher queries to query all the labels as well as apocalyptic.stats but this does not allow me to filter on mother company. I appreciate help.
The cypher is good because you write a query almost in natural language (the query below may be incorrect - did not check, but the idea is clear):
MATCH (motherCompany:Company {name: 'CompanyA'})-[:HAS_SUBSIDIARY]->(childCompany:Company)
WITH motherCompany,
childCompany
MATCH (childCompany)-[:IS_REGISTERED]->(country:Country)
WITH motherCompany,
collect(labels(country)) AS countriesLabels
WITH motherCompany,
countriesLabels,
size([countryLabels IN countriesLabels WHERE 'TH' IN countryLabels ]) AS inTaxHeaven
RETURN motherCompany,
size(countriesLabels) AS total,
inTaxHeaven,
size(countriesLabels) - inTaxHeaven AS inCorporateHeaven

Data Warehouse Design of Fact Tables

I'm pretty new to data warehouse design and am struggling with how to design the fact table given very similar, but somewhat different metrics. Lets say you were evaluating the below metrics, how would you break up the fact tables (in this case company is a subset of client). Would you be able to use one table for all of this or would each metric being measured warrant its own fact table or would each part of the metric being measured be its own column in one fact table?
Total company daily/monthly/yearly # of files processed
Total company daily/monthly/yearly file sizes processed
Total company daily/monthly/yearly # files errored
Total company daily/monthly/yearly # files failed
Total client daily/monthly/yearly # of files processed
Total client daily/monthly/yearly file sizes processed
Total client daily/monthly/yearly # files errored
Total client daily/monthly/yearly # files failed
By the looks of the measure names, I think you'll be served with a single fact table with a record for each file and a link back to a date_dim
create table date_dim (
date_sk int,
calendar_date date,
month_ordinal int,
month_name nvarchar,
Year int,
..etc you've got your own one of these ..
)
create table fact_file_measures (
date_sk,
file_sk, --ref the file_dim with additonal file info
company_sk, --ref the company_dim with the client details
processed int, --should always be one, optional depending on how your reporting team like to work
size_Kb decimal -- specifiy a size measurement, ambiguity is bad
error_count int -- 1 if file had error, 0 if fine
failed_count int -- 1 or file failed, 0 if fine
)
so now you should be able to construct queries to get everything you asked for
for example, for your monthly stats:
select
c.company_name,
c.client_name,
sum(f.file_count) total_files,
sum(f.size_Kb) total_files_size_Kb,
sum(f.file_count) total_files,
sum(f.file_count) total_files
from
fact_file_measure f
inner join dim_company c on f.company_sk = c.company_sk
inner join dim_date d on f.date_sk = d.date_sk
where
d.month = 'January' and d.year = "1984"
If you needed to have the side by side 'day/month/year' stuff, you can construct year and month fact tables to do the roll ups and join back via dim_date's month/year fields. (You could include month and year fields in the daily fact table, but these values may end up being miss-used by less experienced report builders) It all goes back to what your users actually want - design your fact tables to their requirements and don't be afraid to have separate fact tables - data warehouse is not about normalization, its about presenting the data in a way that it can be used.
Good luck

Best Data Access approach for asp.net mvc application with big data

we have big challenge : Application is too slow because of huge records :(
we look for best data access pattern in web application (asp.net mvc 3)
with big data (sql server view with more than 66 million records.)
customers have forms with search fields and will see results within a grid of web page.
since now
we used both these data access ways:
1- entity framework with some features like compiled query
--get result
GridModelsResult = VwLoanInquiryList.Skip(pageIndex * pageSize).Take(pageSize).ToList();
--get count
int totalPages = (int)Math.Ceiling((float)totalRecords / (float)pageSize);
2- sql server store procedure for pagination and get only current page results
--get result
select t1.* into #t from
(select *
FROM VwLoanInquiry
where CustomerNumber like '%'+#CustomerNumber+'%'
) AS [t1]
WHERE (#PageIndex=-1 or (t1.ID BETWEEN (#PageIndex-1)* #RecordPerPage AND ((#PageIndex)* #RecordPerPage)-1))
--get count
select count(*) from #t
3- OLAP; we made cube in analysis services (SSAS)
but here our problem is to show records with string column type within Cube Cells (its impossible to define Non-numeric fields for measures )
in additon ,
to get result faster,our view has index and data types of column has been set .
please help us to know which one of these ways are better ,or if there is an other way what is that?
thanks

Sybase compare columns with duplicate row ids

So far I have a query with a result set (in a temp table) with several columns but I am only concerned with four. One is a customer ID(varchar), one is Date (smalldatetime), one is Amount(money) and the last is Type(char). I have multiple rows with the same custmer ID and want to evaluate them based on Date, Amount and Type. For example:
Customer ID Date Amount Type
A 1-1-10 200 blue
A 1-1-10 400 green
A 1-2-10 400 green
B 1-11-10 100 blue
B 1-11-10 100 red
For all occurrences of A I want to compare them to identify only one, first by earliest date, then by greatest Amount, then if still tied by comparing Types. I would then return one row for each customer.
I would provide some of the query but I am at home now after spending two days trying to get a correct result. It looks something like this:
(query to populate #tempTable)
GROUP BY customer_id
HAVING date_cd =
(SELECT MIN(date_cd)
FROM order_table ot
WHERE ot.customerID = #tempTable.customerID
)
OR date_cd IS NULL
I assume the HAVING would result in only one row per customer_id. This did not end up being the case since there were some ties there.
I am not sure I can do the OR - there are some with NULL values here - and it did not account for the step to the next comparison if they were all the same anyway. I am not seeing a way to avoid doing some row processing of the temp table with some kind of IF or WHERE loop.
As I write I am thinking maybe I use #tempTable.date_cd in the HAVING clause instead of looking at the original table. but that should return the same dates?
Am I on the right track or is there something missing? Suggestions? More info??
try below query :-
select * from #tempTable
GROUP BY customer_id
HAVING isnull(date_cd,"1900/01/01") =min(isnull(date_cd,"1900/01/01"))

DELPHI - help need with a ClientDataSet

I have a ClientDataSet with the following data
IDX EVENT BRANCH_ID BRANCH
1 E1 7 B7
2 E2 5 B5
3 E3 7 B7
4 E4 1 B1
5 E5 2 B2
6 E6 7 B7
7 E7 1 B1
I need to transform this data into
IDX EVENT BRANCH_ID BRANCH
1 E1 7 B7
2 E2 5 B5
4 E4 1 B1
5 E5 2 B2
The only fields of importance are the BRANCH_ID and BRANCH
and BRANCH_ID must be unique
As there is a lot of data I do not what two have copies of it.
QUESTION:
Can you suggest a way to transfrom the data using a Cloned version of the original data ?
Cloning won't allow you to actually change data in a clone and not have same change reflected in the original, so if that's what you want you might rethink the cloning idea.
Cloning does give you a separate cursor into the clone and allows you to filter and index (i.e. order) it independently of the master clientdataset. From the data you've provided it looks like you want to filter some branch data and order by branch_id. You can accomplish that by setting up a new filter and index on the clone. Here's a good article that includes examples of how to do that:
http://edn.embarcadero.com/article/29416
Taking a second look at your question, seems like all you'd need to do would be to set up a unique index on branch_id on the cloned dataset. Linked article above has info on how to set up index; check docs on clientdataset.addindex function for more details and info on setting the index to show only unique values, if I recall it may just mean you set branch_id as the primary key.
I can't think of a slick way to do this, but you could index on BRANCH_ID, add an fkInternalCalc boolean field to your dataset, then initialize that field to True on the first row of each branch (using group state or manually) and then filter the clone on the value of the field. You'd have to update the field on data changes though.
I have a feeling that a better solution would be to have a master dataset with a row for each branch.
You don't provide many details about your use case so I'll try to give you some hints:
"A lot of data" suggests that you might have it from a SQL backend. Using a 'SELECT DISTINCT...' or 'SELECT ... GROUP BY BRANCH_ID' (or similar syntax depending on what SQL backend do you) will give the desired result with ease and speed. Please confirm and I'll give you more details.
As the others said a simple 'clone' wouldn't work. The most simpler (and perhaps quicker) sollution, assuming that ussually the brances are few in number WRT to data is to have an index outside of your dataset. If really you want to filter your original data then add a status field (eg boolean) on your data and put a flag (eg. 'True') on the first occurence.
PseudoCode:
(Let's asume that:
your ClientDataSet is cds1
your cds1 have a status field cds1Status (boolean) - this is optional, needed only if you want to sort/filter/search the cds1
you have a lIndex which is a TStringList)
lIndex.Clear;
lIndex.Sorted:=True;
with cds1 do
try
DisableControls;
First;
while not Eof do //scan the dataset
begin
cVal:=cds1Branch_ID.AsString;
Edit; //we anyway update the Status field
if lIndex.Find(cVal, nDummy) then //nDummy - we don't use it.
begin //already in index
cds1Status.AsBoolean:=False; //we say here "No, isn't the 1st occurence"
end
else
begin //Not found! - Well, let's add it...
lIndex.Append(cVal); //update the index
cds1Status.AsBoolean:=True; //mark the first occurence
end;
Post; //save the changes in the status field
Next;
end; //scan
finally
EnableControls; //housekeeping
end;
//WARNING! - NOT tested. I wrote it from my head but I think that you got the idea...
...Depending on what you try to accomplish (which would be the best thing that you might share with us) and what selectivity do you have on BRANCH_ID perhaps the Status engine isn't needed at all. If you have a very low selectivity on that field (selectivity = no. of unique values / no. of records) perhaps it's much faster to have a new dataset and copy there only the unique values rather than putting each record of the original cds in Edit + Post states. (Changing dataset states are costly operations. Especially if your cds is linked to a remote data storage - ie. a server).
hth,
PS: My sollution is intended to be mostly simple. Also you can test with lIndex.Sorted:=False and use lIndex.IndexOf instead of Find. In some (rare) cases is better. Depends on your data. If you want to complicate the things and the speed is really a concern you can implement a full-blown BTree index to do your searces (libraries available). Also you can use the index engine of CDS and index the BRANCH_ID and do many 'Locate' on a clone but because your selectivity is clearly < 1 scaning the cds's entire index theorethically should be slower that a scan on a unique index, especially if your custom-made index is tailored to your data type, structure, distribuition etc.
just my2c

Resources