Identify duplicate records in multiple databases? - duplicate-data

I am working on Election Department databases in India. I am asked to find the duplicate records of one database with respect to other databases of a state depending on elector name, his guardian name and age. In a state is divided in assembly constituencies and assembly constituencies into polling booth. So my state database has 68 databases as same as no of constituencies. Database name are AC_001, AC_002 so on up to AC_068 and each database contain no of tables depending on no of polling booth in a constituency named as AC001PART001,AC001PART002 so on.... in first database AC_001.
A table roughly contain following following same fields -
ccode( autoincrementation field)
name of elector
Relation_type(Father or husband)
Relation_name(name of guardian)
Assembly constituency no
Polling booth no in assembly
serial no( unique no given to a elector in a pooling booth)
age
image of elector
Now I want a query which can generate the duplicate records of one database with respect to other depending on name, relation name and age. I also want the no of times a record repeated or duplicated. Finally I want a list which contain-
Elector name
Relation_Type
Relation_name
Assembly constituency no
Pooling booth no
Serial no
age
no of times record repeated in both databases
image of elector
I have already created a query but taking very long to return result. So please suggest the overview of query which can generate required record quickly.

Related

What's the use of a dimension that is a subset of a fact table?

In the book: 'Hands-On SQL Server 2019 Analysis Services'; the author presents this model.
In the center I see Sales and InvoiceSales as fact tables... My question is regarding the Invoice dimension, it only has 2 columns which are already present in InvoiceSales, why did he add it?
note: the InvoiceSales fact table has the InvoiceDateKey column.
This may be a business need as a Snapshot Fact Tables:
Snapshot Fact Tables
(Periodic) Snapshot fact tables capture the state of the measures
based on the occurrence of a status event or at a specified
point-in-time or over specified time intervals (week, month, quarter,
year, etc.).

How to count cases with the same ID but different variables in SPSS

I have a data set which has 4420 attendances to a medical department from 1120 people. Each person has a unique ID number and other columns are demographics and primary care provider. I want to filter the data so I can work out how many times each person attends the department and then analyse the data by demographics eg primary care provider or age. It shows whether each attendance is primary or duplicate but I can't figure out how to work out attendances per person.
If what you want to do is to count the number of times each person has visited (assuming each one is represented by a single row in the data), use the AGGREGATE command breaking on the ID variable to add the number of instances to the file as a new variable. In the menus, Data>Aggregate, move the ID variable into the box for Break Variable(s), check the box for Number of cases under Aggregated Variables, change the default N_BREAK to another name if you want, and click OK. That will add a new variable to the data with the number of instances for each unique ID.

Identifying the fact table in data warehouse design

I'm trying to design my first data mart with a star schema from an Excel Sheet containing informations about a Help Desk Service calls, this sheet contains 33 fields including different informations and I can't identify the fact table because I want to do the reporting later based on different KPI's.
I want to know how to identify the fact table measures easily and I have another question which is : Can a fact table contain only foreign keys of dimensions and no measures? Thanks in advance guys and sorry for my bad English.
You can have more than one fact table.
A fact table represents an event or process that you want to analyze.
The structure of the fact tables depend on the process or event that you are trying to analyze.
You need to tell us the events or processes that you want to analyze before we can help you further.
Can a fact table contain only foreign keys of dimensions and no measures?
Yes. This is called a factless fact table.
Let's say you want to do a basic analysis of calls:
Your full table might look like this
CALL_ID
START_DATE
DURATION
AGENT_NAME
AGENT_TENURE (how long worked for company)
CUSTOMER_NAME
CUSTOMER_TENURE (how long a customer)
PRODUCT_NAME (the product the customer is calling about)
RESOLVED
You would turn this into a fact table like this:
CALL_ID
START_DATE_KEY
AGENT_KEY
CUSTOMER_KEY
PRODUCT_KEY
DURATION (measure)
RESOLVED (quasi-measure)
And you would have a DATE dimension table, AGENT dimension table, CUSTOMER dimension table and PRODUCT dimension table.
Agile Data Warehouse Design is a good book, as are the ones by Kimball.
In general, the way I've done it (and there are a number of ways to do anything) is that the categorical data is referenced with a FKey in the fact table, but anything you want to perform aggregations on (typically as data types $/integers/doubles etc) can be in the fact table as well. So for example, a fact table might contain a hierarchy of types, such as product_category >> product_name, and it usually contains a time and/or location field as well; all of which would be referenced by a FKEY to a lookup table. The measure columns are usually integer based or money data, and are used in aggregate functions grouped by the other fields like this:
select sum(measureOne) as sum, product_category from facttable
where timeCol between X and Y group by product_category...etc
At one time a few years ago, I did have a fact table that had no measure column... because the only measure I had was based on count, which I would do dynamically by grouping different dimensions in the fact table.

Data warehouse multivalued attributes

Disclaimer: I have never created a data warehouse before. I have read several chapters of Kimball's Data Warehouse Toolkit.
Background: Plant (factory) management team needs to be able to slice and dice production information in various ways, and we want a consistent reporting format across manufacturing plants in our division. Through business analysis, we have concluded that the fact grain is 1 row per process completed. A completed process can either mean "machine" or "assemble." I am calling this the "Production fact".
The questions that the business needs to answer are the following:
Who was working when the process completed?
What was the cycle time of the process?
What is the serial number of the part was being produced by the process?
My schema includes the following first-level dimensions. I do not have any dimensions beyond the first level, but there are some cross relations between the plant dimension and the part type, shift, and process dimensions.
Part Type (Attributes: Surrogate Key, Part Number, Model, Variant, Part Name)
Plant (Attributes: Surrogate Key, Plant Name, Plant Acronym)
Shift (Attributes: Surrogate Key, Plant Key, Start Hour24, Start Minute, End Hour24, End Minute)
Process (Attributes: Surrogate Key, Plant Key, Production line, Process Group, Process
Name, Machine Type)
Date (typical date dimension attributes)
Time of Day (typical time of day dimension attributes)
The non dimensional facts are:
Part serial Number (instances of a part type)
Cycle time
Employee ID(s) *MULTI-VALUED*
Problem
My problem is that more than one employee may have been working the process at the time. So, I am wondering if I need to change my model and how to best represent the employee in the model. We are not trying to house employee information, just their company employee ID. I've considered the following options:
Allow for multiple employee IDs in the employee column of the fact table (e.g. comma separated). Disadvantage: the number of employees working on the process is a variable number. Would I need to create the field big enough to accommodate up to X number of employees? What should X be?
Create a record for each production fact per employee. This would be mean more than one record for the same fact; that would be bad. :)
Create an employee dimension and an "Process Employees" bridge table between the employee dimension table and the fact table. Problem: the employees working on the process at the time are not represented in the fact table.
Create an Employee dimension, a Process Employees Group table, and a bridge table between Process Employees Group table and the Employee dimension table. The employee group and bridge tables would need to be a) pre-populated with all possible employee combinations--this is not practical on any level since we have thousands of employees-- or b) populated on the fly during ETL. 4b would require a check to see if a given group employees already existed for each process; this might be taxing on the DBMS/ETL system if the source records are batched more frequently than a few times per day (e.g. 10 X's per hour for near real-time reporting).
My Question(s)
I'm thinking that option 3 is the most viable option, but I have some reservations. Are there potential watch-outs? Are there other alternatives that I should consider? Is it okay to take the employees who worked on the process out of the fact table?
Thank you for any advice.
There is a concept called slowly changing dimensions.
These are considered dimensions; basically over here the table which I will call PartEmployee;
The structure of this table will be
PartId - PK
EmployeeId - PK
EmployeeStartDate - PK
EmployeeEndDate
The End Date will be null if the employee is still working on the part. When a new employee starts working on the part, the previous employee record for the part will be closed and a new record created for the part with the new employee.
Add an employee on the PartFact table;
EmployeeId
This column will hold the current employee; This fact record will be updated everytime a new employee starts working on the part...
This will give you the historical perspective of which employees worked on the part and also the information of the employee who worked on the part last.
Hope this helps...
I've had time to think about my options, and none of the 4 options listed in my original post are correct. The problem discussed seems to be a classic "coverage" problem; the business needs to know which employees were working which processes at a given time. If we have that information, we will know who worked who was working on a particular part when a given process completed. This would best be represented as a fact-less fact table between an employee dimension and the production process dimension.
This approach helps also helps me to save space and improve querying power because a single employee "coverage" fact will span multiple process production facts.

Entity Framework and foreign key relationships producing slow sql performance

I have normalized a Country/region/city database into multiple tables. City has a foreign key to region which has a foreign key to country.
The CITY table includes 2 additional columns for finding the associated numerical IPAddress. As you can imagine the city table has over 4 million records (representing the cities in the world which maps back to a region and then a country).
CITY, REGION, COUNTRY are entities that I have mapped with Entity Framework power tools, that all have a name column (that represents a cityname, regionname, countryname, respectively), and a primary key IDENTITY column that is indexed.
Let's say I have a table / entity called VisitorHit that has the following columns:
id as int (primary key, identity)
dateVisited as datetime
FK_City as int (which has a many to one relationship to the CITY entity)
In code I use the VisitorHit entity like:
var specialVisitors = VisitorRepository.GetAllSpecialVisitors();
var distinctCountries = specialVisitors.Select(i => i.City.CityName).Distinct().ToArray();
now the GetAllSpecialVisitors returns a subset of the actual visitors (and it works pretty fast). The typical subset contains approximately 10,000 rows. The Select Distinct statement takes minutes to return. Ultimately I need to further delimit the distinctCountries by a date range (using the visitorhit.datevisited field) and return the count for each distinctCountry.
Any ideas on how I could speed up this operation?
Have you looked at SQL Profiler to see what SQL is being generated for this. My first guess (since you don't post the code for GetAllSpecialVisitors) would be that you are lazy loading the City rows in which case you are going to be producing multiple calls to the database (one for each instance in specialVisitors) to get the city. You can eager load the city in the call to GetAllSpecialVisistors().
Use .Include("City") or .Include(v=>v.City)
e.g. Something like this:
var result = from hit in context.VisitorHits
where /* predicates */
.Include(h =>h.City)
Like I said, you need to look at what the SQL Profiler is showing you to see what SQL Is actually being sent to the SQL Server. But when I have issues like this it turns out to be the most common cause.
If you try writing the query yourself in the SSMS and it works well then another solution may be to write a view and query on the view. That is something else I've done on occasion when Entity Framework produces unwieldy queries that don't work efficiently.

Resources