I've rarely used proc freq before. I'm trying to run the following and I receive an error says SAS is unable to allocate sufficient memory. The dataset has about 15,000 records. What are my alternatives here?
proc freq data=dsb_un noprint;
table bsn*dsb / out=dsb_un2(where=(count>1) drop=percent);
run;
Since you're dropping percent, the following should be identical:
proc freq data=dsb_un noprint;
by bsn;
tables dsb/out=dsb_un2(where=(count>1) drop=percent);
run;
The BY statement should decrease the memory allocation significantly. You also could use PROC SQL in a similar way that would probably fit fine in memory.
The problem is likely that DSB and BSN are mostly unique values each, so you probably have something like 10k+ values for each - making a master table of 10k*10k or 1e8 cells, requiring 8e8 bytes of memory, which may be beyond your available memory for SAS.
I've hit this before as well. The way I got around it was simply not to use proc freq. I believe I used proc summary instead. It can count frequencies just as well.
First a test dataset:
data tmp;
set sashelp.class;
dummy = 1;
run;
Using your original freq approach:
proc freq data=tmp noprint;
table sex*age / out=freq1(where=(count>1) drop=percent);
run;
Using a proc summary approach:
proc summary data=tmp noprint nway missing;
class sex age;
var dummy;
output out=freq2(where=(dummy>1) drop=_type_ _freq_) sum=;
run;
Note that proc summary may need a dummy variable that you can calculate against. Hence the creation of the dummy=1 flag in my test dataset.
Related
I'm trying to use the read_sql_table from dask but I'm facing some issues related to the index_col parameter. My sql table doesn't have any numeric value and I don't know what to give to the index_col parameter.
I read at the documentation that if the index_col is of type "object" I have to provide the "divisions" parameter, but I don't know what are the values in my index_col before reading the table.
I'm really confused. Don't know why I have to give an index_col when using read_sql_table but don't have to when using read_csv.
I've found in certain situations it's easiest to handle this by scattering DataFrame objects out to the cluster by way of pd.read_sql and its chunksize argument:
from dask import bag as db
sql_text = "SELECT ..."
sql_meta = {"column0": "object", "column1": "uint8"}
sql_conn = connect(...)
dfs_futs = map(client.scatter, # Scatter each object to the cluster
pd.read_sql(sql_text,
sql_conn,
chunksize=10_000, # Iterate in chunks of 10,000
columns=list(sql_meta.keys())))
# Now join our chunks (remotely) into a single frame.
df = db.from_sequence(list(dfs_futs)).to_dataframe(meta=sql_meta)
This is nice since you don't need to handle any potential drivers/packages that would be cumbersome to manage on distributed nodes and/or situations where it's difficult to easily partition your data.
Just a note on performance, for my use case we leverage our database's external table operations to spool data out to a CSV and then read that with pd.read_csv (it's pretty much the same deal as above) while a SELECT ... FROM ... WHERE compared to the way Dask parallelizes and chunks up queries, can be acceptable performance-wise since there is a cost to performing the chunking inside the database.
Dask needs a way to be able to read the partitions of your data independently of one-another. This means being able to phrase the queries of each part with a clause like "WHERE index_col >= val0 AND index_col < val1". If you have nothing numerical, dask cab't guess reasonable values for you, you can still do this if you can determine a way to provide reasonable delimiters, like list(string.ascii_letters). You can also provide your own complete WHERE clauses if you must.
Note that OFFSET/LIMIT does not work for this task, because
the result is not in general guaranteed for any given inputs (this behaviour is database implementation specific)
getting to some particular offset is done by paging through the results of a while query, so the server has to do many time the amount of work necessary
I have a large txt (ipaddress.txt) with a lot of lines, each line is a IP Address, eg.:
44.XXX.XXX.XXX
45.XXX.XXX.XXX
46.XXX.XXX.XXX
47.XXX.XXX.XXX
48.XXX.XXX.XXX
i load this list in a TStringList, eg.:
FIpAddressList := TStringList.Create();
FIpAddressList.Sorted := true;
FIpAddressList.LoadFromFile('ipaddress.txt');
and i want check if a IP Address is in the list, eg.:
function IsIPinList(const IPAddress : string): Boolean;
begin
Result := (FIpAddressList.IndexOf(IPAddress) <> -1);
end;
it works... but is slow with huge TStringList.
Is there any way to make this process faster?
UPDATE
The list is static with monthly update, with less or more 5'000 lines.
The function is invoked at each request on a server (eg. 5 times per seconds).
The list is loaded only when the server service start.
One way to make this quicker is to arrange for the list to be sorted so that the search can be performed using binary search, O(log n), rather than linear search, O(n).
If the list is static then the best thing to do is to sort the list outside your program so that you can sort it exactly once.
If the list is more dynamic then you will have to sort it, and keep any modifications ordered. In this scenario, sorting the list will only bring benefits if the number of searches you make is sufficient to overcome the extra cost of sorting and maintaining the order.
Another approach is to use dictionary containing your items. Typically this will involve hashing each string. Whilst the lookup complexity is O(n) the cost of hashing can be significant.
Yet another way to speed things up is to convert the IP address strings to 32 bit integers. In fact this is sure to yield a huge performance benefit, and you should do this irrespective of any other concerns. It is always faster and clearer to work with a 32 bit integer than a string representation of an IP address.
These are just some possible approaches, there are more. Which to choose depends very much on the usage trade offs.
Whilst you probably are just looking for some code to solve your problem, it is my view that this problem is more algorithmic than code based. You need to better understand the requirements, data size constraints, etc. before selecting an algorithm. Once you have decided on an algorithm, or narrowed the choice down to a small number to compare, implementation should be straightforward.
Another possibility is that you have misdiagnosed your problem. Even linear search over a list of 5000 IP addresses stored as strings should be quicker than you are reporting:
My computer can search such a list 2,000 times a second using linear search.
Once you sort the list and switch to binary search, then it can manage 1,000,000 searches a second.
Switch to storing an ordered array of integers, and it achieves over 10,000,000 searches a second.
With a hash based dictionary of integers gives performance twice as fast again.
So, the performance of your search could be improved easily by a factor of 20,000, I still don't think that your performance problems are down to what you believe. I wonder if your real problem that that you are reading the file from disk every time you search.
I want to read the entire contents of a table into memory, as quickly as possible. I am using Nexus database, but there might be some techniques I could use that are applicable to all database types in Delphi.
The table I am looking at has 60,000 records with 20 columns. So not a huge data set.
From my profiling, I have found the following so far:
Accessing tables directly using TnxTable is no faster or slower than using a SQL query and 'SELECT * FROM TableName'
The simple act of looping through the rows, without actually reading or copying any data, takes the majority of the time.
The performance I am getting is
Looping through all records takes 3.5 seconds
Looping through all the records, reading the values and storing them, takes 3.7 seconds (i.e. only 0.2 seconds more)
A sample of my code
var query:TnxQuery;
begin
query.SQL.Text:='SELECT * FROM TableName';
query.Active:=True;
while not query.Eof do
query.Next;
This takes 3.5 seconds on a 60,000 row table.
Does this performance sound reasonable? Are there other approaches I can take that would let me read the data faster?
I am currently reading data from a server on the same computer, but eventually this may be from another server on a LAN.
You should be using BlockRead mode with a TnxTable for optimal read speed:
nxTable.BlockReadOptions := [gboBlobs, gboBookmarks];
//leave out gboBlobs if you want to access blobs only as needed
//leave out gboBookmarks if no bookmark support is required
nxTable.BlockReadSize := 1024*1024; //1MB
// setting block read size performs an implicit First
// while block read mode is active only calls to Next and First are allowed for navigation
try
while not nxTable.Eof do begin
// do something....
nxTable.Next;
end;
finally
nxTable.BlockReadSize := 0;
end;
Also, if you don't need to set a range on a specifc index, make sure to use the sequential access index for fastest possible access.
I am a Delphi programmer.
In a program I have to generate bidimensional arrays with different "branch" lengths.
They are very big and the operation takes a few seconds (annoying).
For example:
var a: array of array of Word;
i: Integer;
begin
SetLength(a, 5000000);
for i := 0 to 4999999 do
SetLength(a[i], Diff_Values);
end;
I am aware of the command SetLength(a, dim1, dim2) but is not applicable. Not even setting a min value (> 0) for dim2 and continuing from there because min of dim2 is 0 (some "branches" can be empty).
So, is there a way to make it fast? Not just by 5..10% but really FAST...
Thank you.
When dealing with a large amount of data, there's a lot of work that has to be done, and this places a theoretical minimum on the amount of time it can be done in.
For each of 5 million iterations, you need to:
Determine the size of the "branch" somehow
Allocate a new array of the appropriate size from the memory manager
Zero out all the memory used by the new array (SetLength does this for you automatically)
Step 1 is completely under your control and can possibly be optimized. 2 and 3, though, are about as fast as they're gonna get if you're using a modern version of Delphi. (If you're on an old version, you might benefit from installing FastMM and FastCode, which can speed up these operations.)
The other thing you might do, if appropriate, is lazy initialization. Instead of trying to allocate all 5 million arrays at once, just do the SetLength(a, 5000000); at first. Then when you need to get at a "branch", first check if its length = 0. If so, it hasn't been initialized, so initialize it to the proper length. This doesn't save time overall, in fact it will take slightly longer in total, but it does spread out the initialization time so the user doesn't notice.
If your initialization is already as fast as it will get, and your situation is such that lazy initialization can't be used here, then you're basically out of luck. That's the price of dealing with large amounts of data.
I just tested your exact code, with a constant for Diff_Values, timed it using GetTickCount() for rudimentary timing. If Diff_Values is 186 it takes 1466 milliseconds, if Diff_Values is 187 it fails with Out of Memory. You know, Out of Memory means Out of Address Space, not really Out of Memory.
In my opinion you're allocating so much data you run out of RAM and Windows starts paging, that's why it's slow. On my system I've got enough RAM for the process to allocate as much as it wants; And it does, until it fails.
Possible solutions
The obvious one: Don't allocate that much!
Figure out a way to allocate all data into one contiguous block of memory: helps with address space fragmentation. Similar to how a bi dimensional array with fixed size on the "branches" is allocated, but if your "branches" have different sizes, you'll need to figure a different mathematical formula, based on your data.
Look into other data structures, possibly ones that cache on disk (to brake the 2Gb address space limit).
In addition to Mason's points, here are some more ideas to consider:
If the branch lengths never change after they are allocated, and you have an upper bound on the total number of items that will be stored in the array across all branches, then you might be able to save some time by allocating one huge chunk of memory and divvying up the "branches" within that chunk yourself. Your array would become a 1 dimensional array of pointers, and each entry in that array points to the start of the data for that branch. You keep track of the "end" of the used space in your big block with a single pointer variable, and when you need to reserve space for a new "branch" you take the current "end" pointer value as the start of the new branch and increment the "end" pointer by the amount of space that branch requires. Don't forget to round up to dword boundaries to avoid misalignment penalties.
This technique will require more use of pointers, but it offers the potential of eliminating all the heap allocation overhead, or at least replacing the general purpose heap allocation with a purpose-built very simple, very fast suballocator that matches your specific use pattern. It should be faster to execute, but it will require more time to write and test.
This technique will also avoid heap fragmentation and reduces the releasing of all the memory to a single deallocation (instead of millions of separate allocations in your present model).
Another tip to consider: If the first thing you always do with the each newly allocated array "branch" is assign data into every slot, then you can eliminate step 3 in Mason's example - you don't need to zero out the memory if all you're going to do is immediately assign real data into it. This will cut your memory write operations by half.
Assuming you can fit the entire data structure into a contiguous block of memory, you can do the allocation in one shot and then take over the indexing.
Note: Even if you can't fit the data into a single contiguous block of memory, you can still use this technique by allocating multiple large blocks and then piecing them together.
First off form a helper array, colIndex, which is to contain the index of the first column of each row. Set the length of colIndex to RowCount+1. You build this by setting colIndex[0] := 0 and then colIndex[i+1] := colIndex[i] + ColCount[i]. Do this in a for loop which runs up to and including RowCount. So, in the final entry, colIndex[RowCount], you store the total number of elements.
Now set the length of a to be colIndex[RowCount]. This may take a little while, but it will be quicker than what you were doing before.
Now you need to write a couple of indexers. Put them in a class or a record.
The getter looks like this:
function GetItem(row, col: Integer): Word;
begin
Result := a[colIndex[row]+col];
end;
The setter is obvious. You can inline these access methods for increased performance. Expose them as an indexed property for convenience to the object's clients.
You'll want to add some code to check for validity of row and col. You need to use colIndex for the latter. You can make this checking optional with {$IFOPT R+} if you want to mimic range checking for native indexing.
Of course, this is a total non-starter if you want to change any of your column counts after the initial instantiation!
I am looking the fastest way to insert many records at once (+1000) to an table using ADO.
Options:
using insert commands and parameters
ADODataSet1.CommandText:='INSERT INTO .....';
ADODataSet1.Parameters.CreateParameter('myparam',ftString,pdInput,12,'');
ADODataSet1.Open;
using TAdoTable
AdoTable1.Insert;
AdoTable1.FieldByName('myfield').Value:=myvale;
//..
//..
//..
AdoTable1.FieldByName('myfieldN').value:=myvalueN;
AdoTable1.Post;
I am using delphi 7, ADO, and ORACLE.
Probably your fastest way would be option 2. Insert all the records and tell the dataset to send it off to the DB. But FieldByName is slow, and you probably shouldn't use it in a big loop like this. If you already have the fields (because they're defined at design time), reference the fields in code their actual names. If not, call FieldByName once for each field and store them in local variables, and reference the fields by these when you're inserting.
Using ADO I think you may be out of luck. Not all back-ends support bulk insert operations and so ADO implements an abstraction to allow consistent coding of apparent bulk operations (batches) irrespective of the back-end support which "under the hood" is merely inserting the "batch" as a huge bunch of parameterised, individual inserts.
The downside of this is that even those back-ends which do support bulk inserts do not always code this into their ADO/OLEDB provider(s) - why bother? (I've seen it mentioned that the Oracle OLEDB provider supports bulk operations and that it is ADO which denies access to it, so it's even possible that the ADO framework simply does not allow a provider to support this functionality more directly in ADO itself - I'm not sure).
But, you mention Oracle, and this back-end definitely does support bulk insert operations via it's native API's.
There is a commercial Delphi component library - ODAC (Oracle Direct Access Components) for, um, direct access to Oracle (it does not even require the Oracle client software to be installed).
This also directly supports the bulk insert capabilities provided by Oracle and is additionally a highly efficient means for accessing your Oracle data stores.
What you are trying to do is called bulk insert. Oracle provides .NET assembly Oracle.DataAccess.dll that you can use for this purpose. There is no hand-made solution that you can think of that would beat the performance of this custom vendor library for the Oracle DBMS.
http://download.oracle.com/docs/html/E10927_01/OracleBulkCopyClass.htm#CHDGJBBJ
http://dotnetslackers.com/articles/ado_net/BulkOperationsUsingOracleDataProviderForNETODPNET.aspx
The most common idea is to use arrays of values for each column and apply them to a template SQL. In the example below employeeIds, firstNames, lastNames and dobs are arrays of the same length with the values to insert.
The Array Binding feature in ODP.NET
allows you to insert multiple records
in one database call. To use Array
Binding, you simply set
OracleCommand.ArrayBindCount to the
number of records to be inserted, and
pass arrays of values as parameters
instead of single values:
> 01. string sql =
> 02. "insert into bulk_test (employee_id, first_name, last_name,
> dob) "
> 03.
> + "values (:employee_id, :first_name, :last_name, :dob)";
> 04.
>
> 05. OracleConnection cnn = new OracleConnection(connectString);
> 06. cnn.Open();
> 07. OracleCommand cmd = cnn.CreateCommand();
> 08. cmd.CommandText = sql;
> 09. cmd.CommandType = CommandType.Text;
> 10. cmd.BindByName = true;
> 11.
>
> 12. // To use ArrayBinding, we need to set ArrayBindCount
> 13. cmd.ArrayBindCount = numRecords;
> 14.
>
> 15. // Instead of single values, we pass arrays of values as parameters
> 16. cmd.Parameters.Add(":employee_id", OracleDbType.Int32,
> 17. employeeIds, ParameterDirection.Input);
> 18. cmd.Parameters.Add(":first_name", OracleDbType.Varchar2,
> 19. firstNames, ParameterDirection.Input);
> 20. cmd.Parameters.Add(":last_name", OracleDbType.Varchar2,
> 21. lastNames, ParameterDirection.Input);
> 22. cmd.Parameters.Add(":dob", OracleDbType.Date,
> 23. dobs, ParameterDirection.Input);
> 24. cmd.ExecuteNonQuery();
> 25. cnn.Close();
As you can see, the code does not look that much different
from doing a regular single-record
insert. However, the performance
improvement is quite drastic,
depending on the number of records
involved. The more records you have to
insert, the bigger the performance
gain. On my development PC, inserting
1,000 records using Array Binding is
90 times faster than inserting the
records one at a time. Yes, you read
that right: 90 times faster! Your
results will vary, depending on the
record size and network
speed/bandwidth to the database
server.
A bit of investigative work reveals
that the SQL is considered to be
"executed" multiple times on the
server side. The evidence comes from
V$SQL (look at the EXECUTIONS column).
However, from the .NET point of view,
everything was done in one call.
You can really improve the insert performance by using the TADOConnection object directly.
dbConn := TADOConnection......
dbConn.BeginTrans;
try
dbConn.Execute(command, cmdText, [eoExecuteNoRecords]);
dbConn.CommitTrans;
except
on E:Exception do
begin
dbConn.RollbackTrans;
Raise e;
end;
end;
Also, the speed can be improved further by inserting more than one records at once.
You could also try the BatchOptmistic mode of the TADODataset. I don't have Oracle so no idea whether it is supported for Oracle, but I have used similar for MS SQL Server.
ADODataSet1.CommandText:='select * from .....';
ADODataSet1.LockType:=ltBatchOptimistic;
ADODataSet1.Open;
ADODataSet1.Insert;
ADODataSet1.FieldByName('myfield').Value:=myvalue1;
//..
ADODataSet1.FieldByName('myfieldN').value:=myvalueN1;
ADODataSet1.Post;
ADODataSet1.Insert;
ADODataSet1.FieldByName('myfield').Value:=myvalue2;
//..
ADODataSet1.FieldByName('myfieldN').value:=myvalueN2;
ADODataSet1.Post;
ADODataSet1.Insert;
ADODataSet1.FieldByName('myfield').Value:=myvalue3;
//..
ADODataSet1.FieldByName('myfieldN').value:=myvalueN3;
ADODataSet1.Post;
// Finally update Oracle with entire dataset in one batch
ADODataSet1.UpdateBatch(arAll);
1000 rows is probably not the point where this approach becomes economic but consider writing the inserts to a flat file & then running the SQL*Loader command line utility. That is seriously the fastest way to bulk upload data into Oracle.
http://www.oracleutilities.com/OSUtil/sqlldr.html
I've seen developers spend literally weeks writing (Delphi) loading routines that performed several orders of magnitude slower than SQL*Loader controlled by a control file that took around an hour to write.
Remenber to disable posible control that are linked to the Dataset/Table/Query/...
...
ADOTable.Disablecontrols;
try
...
finally
ADOTable.enablecontrols;
end;
...
You might try Append instead of Insert:
AdoTable1.Append;
AdoTable1.FieldByName('myfield').Value:=myvale;
//..
//..
//..
AdoTable1.FieldByName('myfieldN').value:=myvalueN;
AdoTable1.Post;
With Append, you will save some effort on the client dataset, as the records will get added to the end rather than inserting records and pushing the remainder down.
Also, you can unhook any data aware controls that might be bound to the dataset or lock them using BeginUpdate.
I get pretty decent performance out of the append method, but if you're expecting bulk speeds, you might want to look at inserting multiple rows in a single query by executing the query itself like this:
AdoQuery1.SQL.Text = 'INSERT INTO myTable (myField1, myField2) VALUES (1, 2), (3, 4)';
AdoQuery1.ExecSQL;
You should get some benefits from the database engine when inserting multiple records at once.