Generate a count variable (by group) if a condition is met in SAS - foreach

For each ID, count the number of times "FAIL" appeared in the QC column, and display that number in the output column (should be generated by the code).
enter image description here
proc sort data=dataset;
by ID;
run;
data dataset;
set dataset;
by ID;
retain count 0;
if first.qc then count=count+1;
run;

You need to make two passes through the data to generate the output in your photograph. One to count and the second to attach the count back onto the individual observations in the group.
You could use PROC SQL because it will automatically re-merge the aggregate statistics back for you.
proc sql;
create table want as
select *,sum( QC='FAIL' ) as COUNT
from have
group by id
;
You could do it with a data step by reading the input twice. Once to do the count and then again to re-read the observations and write them out.
data want;
do until(last.id);
set have;
by id;
count=sum(count, qc='FAIL' );
end;
do until(last.id);
set have;
by id;
output;
end;
run;

Related

Return 2 resultset from cursor based on one query (nested cursor)

I'm trying to obtain 2 different resultset from stored procedure, based on a single query. What I'm trying to do is that:
1.) return query result into OUT cursor;
2.) from this cursor results, get all longest values in each column and return that as second OUT
resultset.
I'm trying to avoid doing same thing twice with this - get data and after that get longest column values of that same data. I'm not sure If this is even possible, but If It is, can somebody show me HOW ?
This is an example of what I want to do (just for illustration):
CREATE OR REPLACE PROCEDURE MySchema.Test(RESULT OUT SYS_REFCURSOR,MAX_RESULT OUT SYS_REFCURSOR)
AS
BEGIN
OPEN RESULT FOR SELECT Name,Surname FROM MyTable;
OPEN MAX_RESULT FOR SELECT Max(length(Name)),Max(length(Surname)) FROM RESULT; --error here
END Test;
This example compiles with "ORA-00942: table or view does not exist".
I know It's a silly example, but I've been investigating and testing all sorts of things (implicit cursors, fetching cursors, nested cursors, etc.) and found nothing that would help me, specially when working with stored procedure returning multiple resultsets.
My overall goal with this is to shorten data export time for Excel. Currently I have to run same query twice - once for calculating data size to autofit Excel columns, and then for writing data into Excel.
I believe that manipulating first resultset in order to get second one would be much faster - with less DB cycles made.
I'm using Oracle 11g, Any help much appreciated.
Each row of data from a cursor can be read exactly once; once the next row (or set of rows) is read from the cursor then the previous row (or set of rows) cannot be returned to and the cursor cannot be re-used. So what you are asking is impossible as if you read the cursor to find the maximum values (ignoring that you can't use a cursor as a source in a SELECT statement but, instead, you could read it using a PL/SQL loop) then the cursor's rows would have been "used up" and the cursor closed so it could not be read from when it is returned from the procedure.
You would need to use two separate queries:
CREATE PROCEDURE MySchema.Test(
RESULT OUT SYS_REFCURSOR,
MAX_RESULT OUT SYS_REFCURSOR
)
AS
BEGIN
OPEN RESULT FOR
SELECT Name,
Surname
FROM MyTable;
OPEN MAX_RESULT FOR
SELECT MAX(LENGTH(Name)) AS max_name_length,
MAX(LENGTH(Surname)) AS max_surname_length
FROM MyTable;
END Test;
/
Just for theoretical purposes, it is possible to only read from the table once if you bulk collect the data into a collection then select from a table-collection expression (however, it is going to be more complicated to code/maintain and is going to require that the rows from the table are stored in memory [which your DBA might not appreciate if the table is large] and may not be more performant than compared to just querying the table twice as you'll end up with three SELECT statements instead of two).
Something like:
CREATE TYPE test_obj IS OBJECT(
name VARCHAR2(50),
surname VARCHAR2(50)
);
CREATE TYPE test_obj_table IS TABLE OF test_obj;
CREATE PROCEDURE MySchema.Test(
RESULT OUT SYS_REFCURSOR,
MAX_RESULT OUT SYS_REFCURSOR
)
AS
t_names test_obj_table;
BEGIN
SELECT Name,
Surname
BULK COLLECT INTO t_names
FROM MyTable;
OPEN RESULT FOR
SELECT * FROM TABLE( t_names );
OPEN MAX_RESULT FOR
SELECT MAX(LENGTH(Name)) AS max_name_length,
MAX(LENGTH(Surname)) AS max_surname_length
FROM TABLE( t_names );
END Test;
/

How to check if join was 1-1 or 1-many in SAS?

Is there an efficient way in SAS to verify if a join you ran was a 1 to 1 or a 1 to many join? I often work with tables that do not have a clear unique identifier which has led me to running 1-many joins thinking they were 1-1, thus messing up my analysis.
In the simple case where I'm expecting the input datasets for a merge to be unique by some key, I will often code a simple assertion into the merge that throws an error if any duplicates are found:
Sample data:
data one;
do id=1,2,3;
output;
end;
run;
data two;
do id=1,2,2,3,4,4;
output;
end;
run;
Log:
16 data want;
17 merge one two;
18 by id;
19 if not (first.id and last.id) then put "ERROR: duplicates!" id=;
20 run;
ERROR: duplicates!id=2
ERROR: duplicates!id=2
ERROR: duplicates!id=4
ERROR: duplicates!id=4
NOTE: There were 3 observations read from the data set WORK.ONE.
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 6 observations and 1 variables
That doesn't tell you which dataset has duplicates (for that you need to use in= variables like Tom's answer), but it's an easy safety net to catch duplicates.
You can also just check your output dataset for duplicates after the merge, e.g.
data _null_;
set want (keep=id);
by id;
if not (first.id and last.id) then put "ERROR: Duplicate ! " id=;
run;
Duplicates are dangerous.
You can use the IN= flags, but you need to clear them.
Let's make some sample datasets.
data one;
do id=1,2,2,3;
output;
end;
run;
data two;
do id=1,1,2,2,3,3;
output;
end;
run;
Now merge them by ID. Clear the IN= variables before the MERGE statement so that the flag is not carried forward on the dataset with just a single observation.
data want ;
call missing(in1,in2);
merge one(in=in1) two (in=in2);
by id;
if not first.id and sum(of in1-in2)> 1 then put 'Multiple Merge: ' (_n_ id in1 in2) (=);
run;
Results in the LOG.
Multiple Merge: _N_=4 id=2 in1=1 in2=1
NOTE: MERGE statement has more than one data set with repeats of BY values.
NOTE: There were 4 observations read from the data set WORK.ONE.
NOTE: There were 6 observations read from the data set WORK.TWO.
NOTE: The data set WORK.WANT has 6 observations and 1 variables.
Checking before merging is a better idea... Here are two nice and easy ways to do it. (Supposing we have a dataset named one with column id to be used for the merge).
Identify duplicate id's with PROC FREQ
proc freq data = one noprint;
table id /out = freqs_id_one(where=(count>1));
run;
Sort dataset using nodupkey
...redirecting duplicate id's in a distinct dataset:
proc sort data=one nodupkey out=one_nodupids dupout=one_dupids;
by id;
run;
Checking after-the-fact
If you realize too late that you didn't check for dupes (doh!), you can obtain the frequencies of the id with PROC FREQ (same code as above) or with a PROC SQL query:
proc sql;
select id,
count(id) as count
from merged_dataset
group by id
having count > 1;
quit;

PostgreSQL gapless sequences

I'm moving from MySql to Postgres, and I noticed that when you delete rows from MySql, the unique ids for those rows are re-used when you make new ones. With Postgres, if you create rows, and delete them, the unique ids are not used again.
Is there a reason for this behaviour in Postgres? Can I make it act more like MySql in this case?
Sequences have gaps to permit concurrent inserts. Attempting to avoid gaps or to re-use deleted IDs creates horrible performance problems. See the PostgreSQL wiki FAQ.
PostgreSQL SEQUENCEs are used to allocate IDs. These only ever increase, and they're exempt from the usual transaction rollback rules to permit multiple transactions to grab new IDs at the same time. This means that if a transaction rolls back, those IDs are "thrown away"; there's no list of "free" IDs kept, just the current ID counter. Sequences are also usually incremented if the database shuts down uncleanly.
Synthetic keys (IDs) are meaningless anyway. Their order is not significant, their only property of significance is uniqueness. You can't meaningfully measure how "far apart" two IDs are, nor can you meaningfully say if one is greater or less than another. All you can do is say "equal" or "not equal". Anything else is unsafe. You shouldn't care about gaps.
If you need a gapless sequence that re-uses deleted IDs, you can have one, you just have to give up a huge amount of performance for it - in particular, you cannot have any concurrency on INSERTs at all, because you have to scan the table for the lowest free ID, locking the table for write so no other transaction can claim the same ID. Try searching for "postgresql gapless sequence".
The simplest approach is to use a counter table and a function that gets the next ID. Here's a generalized version that uses a counter table to generate consecutive gapless IDs; it doesn't re-use IDs, though.
CREATE TABLE thetable_id_counter ( last_id integer not null );
INSERT INTO thetable_id_counter VALUES (0);
CREATE OR REPLACE FUNCTION get_next_id(countertable regclass, countercolumn text) RETURNS integer AS $$
DECLARE
next_value integer;
BEGIN
EXECUTE format('UPDATE %s SET %I = %I + 1 RETURNING %I', countertable, countercolumn, countercolumn, countercolumn) INTO next_value;
RETURN next_value;
END;
$$ LANGUAGE plpgsql;
COMMENT ON get_next_id(countername regclass) IS 'Increment and return value from integer column $2 in table $1';
Usage:
INSERT INTO dummy(id, blah)
VALUES ( get_next_id('thetable_id_counter','last_id'), 42 );
Note that when one open transaction has obtained an ID, all other transactions that try to call get_next_id will block until the 1st transaction commits or rolls back. This is unavoidable and for gapless IDs and is by design.
If you want to store multiple counters for different purposes in a table, just add a parameter to the above function, add a column to the counter table, and add a WHERE clause to the UPDATE that matches the parameter to the added column. That way you can have multiple independently-locked counter rows. Do not just add extra columns for new counters.
This function does not re-use deleted IDs, it just avoids introducing gaps.
To re-use IDs I advise ... not re-using IDs.
If you really must, you can do so by adding an ON INSERT OR UPDATE OR DELETE trigger on the table of interest that adds deleted IDs to a free-list side table, and removes them from the free-list table when they're INSERTed. Treat an UPDATE as a DELETE followed by an INSERT. Now modify the ID generation function above so that it does a SELECT free_id INTO next_value FROM free_ids FOR UPDATE LIMIT 1 and if found, DELETEs that row. IF NOT FOUND gets a new ID from the generator table as normal. Here's an untested extension of the prior function to support re-use:
CREATE OR REPLACE FUNCTION get_next_id_reuse(countertable regclass, countercolumn text, freelisttable regclass, freelistcolumn text) RETURNS integer AS $$
DECLARE
next_value integer;
BEGIN
EXECUTE format('SELECT %I FROM %s FOR UPDATE LIMIT 1', freelistcolumn, freelisttable) INTO next_value;
IF next_value IS NOT NULL THEN
EXECUTE format('DELETE FROM %s WHERE %I = %L', freelisttable, freelistcolumn, next_value);
ELSE
EXECUTE format('UPDATE %s SET %I = %I + 1 RETURNING %I', countertable, countercolumn, countercolumn, countercolumn) INTO next_value;
END IF;
RETURN next_value;
END;
$$ LANGUAGE plpgsql;

Self reference update on insert trigger in Informix

I'm extracting data from various sources into one table. In this new table, there's a field called lineno. This field value is should be in sequence based on company code and batch number. I've wrote the following procedure
CREATE PROCEDURE update_line(company CHAR(4), batch CHAR(8), rcptid CHAR(12));
DEFINE lineno INT;
SELECT Count(*)
INTO lineno
FROM tmp_cb_rcpthdr
WHERE cbrh_company = company
AND cbrh_batchid = batch;
UPDATE tmp_cb_rcpthdr
SET cbrh_lineno = lineno + 1
WHERE cbrh_company = company
AND cbrh_batchid = batch
AND cbrh_rcptid = rcptid;
END PROCEDURE;
This procedure will be called using the following trigger
CREATE TRIGGER tmp_cb_rcpthdr_ins INSERT ON tmp_cb_rcpthdr
REFERENCING NEW AS n
FOR EACH ROW
(
EXECUTE PROCEDURE update_line(n.company, cbrh_batchid, cbrh_rcptid)
);
However, I got the following error
SQL Error = -747 Table or column matches object referenced in
triggering statement.
From oninit.com, I learn that the error caused by a triggered SQL statement acts on the triggering table which in this case is the UPDATE statement.
So my question is, how do I solve this problem? Is there any work around or better solution?
I think the design needs to be reconsidered. For a start, what happens if some rows get deleted from tmp_cb_rcpthdr ? The COUNT(*) query will result in duplicate lineno values.
Even if this is an ETL only process, and you can be confident the data won't be manipulated from elsewhere, performance will be an issue, and will only get worse the more data you have for any one combination of company and batch_id.
Is it necessary for the lineno to increment from zero, or is it just to maintain the original load order? Because if it's the latter, a SEQUENCE or a SERIAL field on the table will achieve the same end, and be a lot more efficient.
If you must generate lineno in this way, I would suggest you create a second control table, keyed on company and batch_id, that tracks the current lineno value, ie: (untested)
CREATE PROCEDURE update_line(company CHAR(4), batch CHAR(8));
DEFINE lineno INT;
SELECT cbrh_lineno INTO lineno
FROM linenoctl
WHERE cbrh_company = company
AND cbrh_batchid = batch;
UPDATE linenoctl
SET cbrh_lineno = lineno + 1
WHERE cbrh_company = company
AND cbrh_batchid = batch;
-- A test that no other process has grabbed this record
-- might need to be considered here, ie cbrh_lineno = lineno
RETURN lineno + 1
END PROCEDURE;
Then use it as follows:
CREATE TRIGGER tmp_cb_rcpthdr_ins INSERT ON tmp_cb_rcpthdr
REFERENCING NEW AS n
FOR EACH ROW
(
EXECUTE PROCEDURE update_line(n.company, cbrh_batchid) INTO cbrh_lineno
);
See the IDS documentation for more on using calculated values with triggers.

MySQL stored procedure causing problems?

EDIT:
I've narrowed my mysql wait timeout down to this line:
IF #resultsFound > 0 THEN
INSERT INTO product_search_query (QueryText, CategoryId) VALUES (keywords, topLevelCategoryId);
END IF;
Any idea why this would cause a problem? I can't work it out!
I've written a stored proc to search for products in certain categories, due to certain constraints I came across, I was unable to do what I wanted (limiting, but whilst still returning the total number of rows found, with sorting, etc..)
It's meant splits up a string of category Ids, from 1,2,3 in to a temporary table, then builds the full-text search query based on sorting options and limits, executes the query string and then selects out the total number of results.
Now, I know I'm no MySQL guru, very far from it, I've got it working, but I keep getting time outs with product searches etc. So I'm thinking this may be causing some kind of problem?
Does anyone have any ideas how I can tidy this up, or even do it in a much better way that I probably don't know about?
Thanks.
DELIMITER $$
DROP PROCEDURE IF EXISTS `product_search` $$
CREATE DEFINER=`root`#`localhost` PROCEDURE `product_search`(keywords text, categories text, topLevelCategoryId int, sortOrder int, startOffset int, itemsToReturn int)
BEGIN
declare foundPos tinyint unsigned;
declare tmpTxt text;
declare delimLen tinyint unsigned;
declare element text;
declare resultingNum int unsigned;
drop temporary table if exists categoryIds;
create temporary table categoryIds
(
`CategoryId` int
) engine = memory;
set tmpTxt = categories;
set foundPos = instr(tmpTxt, ',');
while foundPos <> 0 do
set element = substring(tmpTxt, 1, foundPos-1);
set tmpTxt = substring(tmpTxt, foundPos+1);
set resultingNum = cast(trim(element) as unsigned);
insert into categoryIds (`CategoryId`) values (resultingNum);
set foundPos = instr(tmpTxt,',');
end while;
if tmpTxt <> '' then
insert into categoryIds (`CategoryId`) values (tmpTxt);
end if;
CASE
WHEN sortOrder = 0 THEN
SET #sortString = "ProductResult_Relevance DESC";
WHEN sortOrder = 1 THEN
SET #sortString = "ProductResult_Price ASC";
WHEN sortOrder = 2 THEN
SET #sortString = "ProductResult_Price DESC";
WHEN sortOrder = 3 THEN
SET #sortString = "ProductResult_StockStatus ASC";
END CASE;
SET #theSelect = CONCAT(CONCAT("
SELECT SQL_CALC_FOUND_ROWS
supplier.SupplierId as Supplier_SupplierId,
supplier.Name as Supplier_Name,
supplier.ImageName as Supplier_ImageName,
product_result.ProductId as ProductResult_ProductId,
product_result.SupplierId as ProductResult_SupplierId,
product_result.Name as ProductResult_Name,
product_result.Description as ProductResult_Description,
product_result.ThumbnailUrl as ProductResult_ThumbnailUrl,
product_result.Price as ProductResult_Price,
product_result.DeliveryPrice as ProductResult_DeliveryPrice,
product_result.StockStatus as ProductResult_StockStatus,
product_result.TrackUrl as ProductResult_TrackUrl,
product_result.LastUpdated as ProductResult_LastUpdated,
MATCH(product_result.Name) AGAINST(?) AS ProductResult_Relevance
FROM
product_latest_state product_result
JOIN
supplier ON product_result.SupplierId = supplier.SupplierId
JOIN
category_product ON product_result.ProductId = category_product.ProductId
WHERE
MATCH(product_result.Name) AGAINST (?)
AND
category_product.CategoryId IN (select CategoryId from categoryIds)
ORDER BY
", #sortString), "
LIMIT ?, ?;
");
set #keywords = keywords;
set #startOffset = startOffset;
set #itemsToReturn = itemsToReturn;
PREPARE TheSelect FROM #theSelect;
EXECUTE TheSelect USING #keywords, #keywords, #startOffset, #itemsToReturn;
SET #resultsFound = FOUND_ROWS();
SELECT #resultsFound as 'TotalResults';
IF #resultsFound > 0 THEN
INSERT INTO product_search_query (QueryText, CategoryId) VALUES (keywords, topLevelCategoryId);
END IF;
END $$
DELIMITER ;
Any help is very very much appreciated!
There is little you can do with this query.
Try this:
Create a PRIMARY KEY on categoryIds (categoryId)
Make sure that supplier (supplied_id) is a PRIMARY KEY
Make sure that category_product (ProductID, CategoryID) (in this order) is a PRIMARY KEY, or you have an index with ProductID leading.
Update:
If it's INSERT that causes the problem and product_search_query in a MyISAM table the issue can be with MyISAM locking.
MyISAM locks the whole table if it decides to insert a row into a free block in the middle of the table which can cause the timeouts.
Try using INSERT DELAYED instead:
IF #resultsFound > 0 THEN
INSERT DELAYED INTO product_search_query (QueryText, CategoryId) VALUES (keywords, topLevelCategoryId);
END IF;
This will put the records into the insertion queue and return immediately. The record will be added later asynchronously.
Note that you may lose information if the server dies after the command is issued but before the records are actually inserted.
Update:
Since your table is InnoDB, it may be an issue with table locking. INSERT DELAYED is not supported on InnoDB.
Depending on the nature of the query, DML queries on InnoDB table may place gap locks which will lock the inserts.
For instance:
CREATE TABLE t_lock (id INT NOT NULL PRIMARY KEY, val INT NOT NULL) ENGINE=InnoDB;
INSERT
INTO t_lock
VALUES
(1, 1),
(2, 2);
This query performs ref scans and places the locks on individual records:
-- Session 1
START TRANSACTION;
UPDATE t_lock
SET val = 3
WHERE id IN (1, 2)
-- Session 2
START TRANSACTION;
INSERT
INTO t_lock
VALUES (3, 3)
-- Success
This query, while doing the same, performs a range scan and places a gap lock after key value 2, which will not let insert key value 3:
-- Session 1
START TRANSACTION;
UPDATE t_lock
SET val = 3
WHERE id BETWEEN 1 AND 2
-- Session 2
START TRANSACTION;
INSERT
INTO t_lock
VALUES (3, 3)
-- Locks
Try wrapping your EXECUTE with the following:
SET SESSION TRANSACTION ISOLATION LEVEL READ UNCOMMITTED ;
EXECUTE TheSelect USING #keywords, #keywords, #startOffset, #itemsToReturn;
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ ;
I do something similiar in TSQL for all report stored proc and searches where repeatable reads aren't important to reduce locking/blocking issues with other processes running on the database.
Turn on slow queries, that will give you an idea of what is taking so long to execute that there is a timeout.
http://dev.mysql.com/doc/refman/5.1/en/slow-query-log.html
Pick the slowest query and optimise that. then run for a while and repeat.
There is some excellent information and tools here http://hackmysql.com/nontech
DC
UPDATE:
Either you have a network problem causing the timeout, if you are using a local mysql instance then that is unlikely, OR something is locking a table for far too long causing a timeout. the process that is locking the table or tables for far too long will be listed in the slow log as a slow query. you can also get the slow log query to display any queries that fail to use an index resulting in an inefficient query.
If you can get the problem to occur while you are present then you can also use a tool like phpmyadmin or the commandline to run "SHOW PROCESSLIST\G" this will give you a list of what queries are running while the problem is occurring.
You think the problem is in your insert statement, therefore something is locking that table. therefore you need to find what is locking that table, therefore you need to find what is running so slow its locking the table for far too long. Slow queries is one way to do that.
Other things to look at
CPU - is it idle or running at full pelt
IO - is io causing holdups
RAM - are you swapping all the time (will cause excessive io)
Does the table product_search_query use an index?
What is the primary key?
If your index uses strings that are too long? you may build a huge index file that causes very slow inserts (slow query log will also show that)
And yes the problem may be elsewhere, but you must start somewhere mustn't you.
DC

Resources