XLS Row-Level Checksum?

XLS Row-Level Checksum? - checksum

I have a large XLS file produced from an SQL database for offline review and edit. I'd like to identify/flag just the modified rows to permit the construction of the necessary SQL commands to make the necessary row-level updates back in the database.
I can currently update ALL the rows in a set (changed or not), but performance across our VPN is pretty poor, and sometimes just not feasible. It would be very helpful to consider just the flagged rows.
Any suggestions to begin to address this?
Notes:
The XLS row count can be 5K to 100K (or higher) rows depending on the export query. There are 25 columns per row; format is fixed.
No issue with VBA/Macros -
A poorman's approach is fine - this is an internal project.
Thanks!

I found a very workable approach.
Simply drop this code into a new module, adjust the the result/column, and presto.
This works with cutting/pasting across many rows:
Private Sub Worksheet_Change(ByVal Target As Range)
Dim c As Range
Application.EnableEvents = False
For Each c In Target
If c.Column > 1 And c.Column < 18 Then
Cells(c.Row, 1) = Now
End If
Next c
Application.EnableEvents = True
End Sub

Related

How to read out a list of cases in one variable in SPSS and use that to add data?

To explain my problem I use this example data set:
SampleID Date Project Problem
03D00173 03-Dec-2010 1,00
03D00173 03-Dec-2010 1,00
03D00173 28-Sep-2009 YNTRAD
03D00173 28-Sep-2009 YNTRAD
Now, the problem is that I need to replace the text "YNTRAD" with "YNTRAD_PILOT" but only for the cases with Date = 28-Sep-2009.
This is example is part of a much larger database, with many more cases having Project=YNTRAD and Data=28-Sep-2009, so I can not simply select first all cases with 28-Sep-2009, then check which of these cases have Project=YNTRAD and then replace. Instead, what I need to do is:
Look at each case that has a 1,00 in Problem (these are problem
cases)
Then find the SampleID that corresponds with that sample
Then find all other cases with the same SampleID BUT WITH
Date=28-Sep-2009 (this is needed because only those samples are part
of a pilot study) and then replace YNTRAD in Project to
YNTRAD_PILOT.
I read a lot about:
LOOP
- DO REPEAT
- DO IF
but I don't know how to use these in solving this problem.
I first tried making a list containing only the sample ID's that need eventually to be changed (again, this is part of a much larger database).
STRING SampleID2 (A20).
IF (Problem=1) SampleID2=SampleID.
EXECUTE.
AGGREGATE
/OUTFILE=*
/BREAK=SampleID2
/n_SampleID2=N.
This gives a dataset with only the SampleID's for which a change should be made. However I don't know how to read out this dataset case by case and looking up each SampleID in the overall file with all the date and then change only those cases were Date = 28-Sep-2009.

It sounds like once we can identify the IDs that need to be changed we've done the tricky part here. We can use AGGREGATE with MODE=ADDVARIABLES to add a problem Id counter variable to our dataset. From there, it's as you'd expect.
* Add var IdProblemCnt to your database . Stores # of times a given Id had a record with Problem = 1.
AGGREGATE
/OUTFILE=* MODE=ADDVARIABLES
/BREAK=SampleId
/IdProblemCnt=CIN(Problem, 1, 1) .
EXE .
* once we've identified the "problem" Ids we can use `RECODE` Project var.
DO IF (IdProblemCnt>0 AND Date = DATE.MDY(9,28,2009) .
RECODE Project ('YNTRAD' = 'YNTRAD_PILOT') .
END IF .
EXE .

Data Warehouse Design of Fact Tables

I'm pretty new to data warehouse design and am struggling with how to design the fact table given very similar, but somewhat different metrics. Lets say you were evaluating the below metrics, how would you break up the fact tables (in this case company is a subset of client). Would you be able to use one table for all of this or would each metric being measured warrant its own fact table or would each part of the metric being measured be its own column in one fact table?
Total company daily/monthly/yearly # of files processed
Total company daily/monthly/yearly file sizes processed
Total company daily/monthly/yearly # files errored
Total company daily/monthly/yearly # files failed
Total client daily/monthly/yearly # of files processed
Total client daily/monthly/yearly file sizes processed
Total client daily/monthly/yearly # files errored
Total client daily/monthly/yearly # files failed

By the looks of the measure names, I think you'll be served with a single fact table with a record for each file and a link back to a date_dim
create table date_dim (
date_sk int,
calendar_date date,
month_ordinal int,
month_name nvarchar,
Year int,
..etc you've got your own one of these ..
)
create table fact_file_measures (
date_sk,
file_sk, --ref the file_dim with additonal file info
company_sk, --ref the company_dim with the client details
processed int, --should always be one, optional depending on how your reporting team like to work
size_Kb decimal -- specifiy a size measurement, ambiguity is bad
error_count int -- 1 if file had error, 0 if fine
failed_count int -- 1 or file failed, 0 if fine
)
so now you should be able to construct queries to get everything you asked for
for example, for your monthly stats:
select
c.company_name,
c.client_name,
sum(f.file_count) total_files,
sum(f.size_Kb) total_files_size_Kb,
sum(f.file_count) total_files,
sum(f.file_count) total_files
from
fact_file_measure f
inner join dim_company c on f.company_sk = c.company_sk
inner join dim_date d on f.date_sk = d.date_sk
where
d.month = 'January' and d.year = "1984"
If you needed to have the side by side 'day/month/year' stuff, you can construct year and month fact tables to do the roll ups and join back via dim_date's month/year fields. (You could include month and year fields in the daily fact table, but these values may end up being miss-used by less experienced report builders) It all goes back to what your users actually want - design your fact tables to their requirements and don't be afraid to have separate fact tables - data warehouse is not about normalization, its about presenting the data in a way that it can be used.
Good luck

Entity Framework 6.1.3

I have a project that looks up distinct orders from the database. It then creates a string of CustomerNumbers from a field in the returned orders. Then it filters the orders based on the CustomerNumbers. The issue is it only ever returns 2 customers when there should be 10. I did return a count of all customers & that also only returns 2 customers. There are a total of 667K+ customers in the database. I tried uninstalling the EF nuget package & reinstalling. I checked to make sure the repositories I have setup aren't filtering anything in anyway. I'm stuck & under the gun right. Any help would be great. Also any refactoring suggestions or EF changes are welcome too. Thanks!
var count = dbCustomer.Records.Count();
var orders = dbOrders.Records.ToList();
data.Orders = orders;
var orderCustomerNumbers = data.Orders.Select(o => oMeta15).Distinct().ToList();
var orderNumbers = data.Orders.Select(o => o.OrderNumber.ToLower()).ToList();
data.Customers = dbCustomer.Records.Where(c => orderCustomerNumbers.Contains(c.CustomerNumber)).To.List();
data.Payments = dbPayment.Records.Where(p => orderCustomerNumbers.Contains(p.OrderNumber.ToLower())).ToList();
data.Products = dbProducts.Records.Tolist();

Sorry for not getting more details out, but I did fix the issue. Originally I had issues getting the data into the database using the app(EF). To get around the issue and keep moving forward I used SSMS to import the data. Since I didn't use the app(EF), to import the data the Discriminator column was never filled in with data. Therefore the only two rows that were being returned were the rows with the correct value in the Discriminator column. After running a quick SQL statement to update all the rows with the correct Discriminator value everything is working fine now.

Returning a semi-unique set of most recent records

In my application a User has Highlights.
Each Highlight has a HighlightType. So if I run user.highlights I might see an output like this:
Notice that there are many highlights of type_id 47. This marks milestones of the number of times the user has gone running.
What I would like to do is return this full list of records, but only include one highlight for each highlight_type, and I want that one record to be the most recent record (in this case the "50th run" highlight). So in the example above I would get the same results but with IDs 195-199 removed.
Is there an efficient way to accomplish this?

I don't think there is an easy or clean way to achieve that, nor a "Rails way". Look at e.g. this link
According to one suggestion in that link you would do this SQL request:
SELECT h1.*
FROM highlights h1
LEFT JOIN highlights h2
ON (h1.user_id = h2.user_id
AND h1.highlight_type_id = h2.highlight_type_id
AND h1.created_at < h2.created_at)
WHERE h2.id IS NULL AND h1.user_id = <the user id you are interested in>
group by h1.highlight_type_id
I think it will be some performance problem if you have big tables maybe, an it not so very clean I think.
Otherwise, if there isn't so much highlights for a user I would have done something like this:
rows = {}
user.highlights.order('highlight_type_id, created_at DESC').each do |hi|
rows[hi.highlight_type_id] ||= hi
end
# then use rows which will have one object for each highlight_type_id
The DESC on created_at is important
EDIT:
I also saw some suggestions based on this
user.highlights.group('highlight_type_id').order('created_at DESC')
And that was also how I first thought it should be solved, but I tested it and it doesn't seems to get a correct result - at least on my test data.

SQLite slow query on iPad

I have a table with almost 300K records in it. I run a simple select statement with a where clause on an indexed column ('type' is indexed):
SELECT *
FROM Asset_Spec
WHERE type = 'County'
That query is fast - about 1 second. Additionally I want to test against status:
SELECT *
FROM Asset_Spec
WHERE type = 'County'
AND status = 'Active'
The second one is VERY slow (minutes). Status is NOT indexed and in this particular case 99.9% of values in the db ARE 'Active'.
Any ideas how I can get better performance? We are compiling our own version of SQLite so I can tweak many settings (FYI - same performance on iOS pre-canned SQLite)

I looked at the query plan and the estimate for number of rows was off by an astounding amount. Asset_Spec (~2 rows) - actual number of rows is almost 300,000. Ran 'ANALYZE' - now the same query runs in 16ms.

the first thing I would try is using a subquery
SELECT * FROM
(SELECT *
FROM Asset_Spec
WHERE type = 'County')
WHERE status = 'Active'
and as Robert suggests, adding an index on any column you want to filter by is a good idea. I'd also consider changing fields Type and Status to be something other than string.

Any reason you need to select *?
Suggestions:
Do you need to retrieve multiple records? If all you need is the first record found, then add "limit 1" to the end of the query.
If you're just checking for the existence of a row, i.e. you only need to know that there is one row with status active, then "select 1" instead of "select *".

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart