I want to know if there is any other way to load data from a text file other than using external tables.
Text file looks something like
101 fname1 lname1 D01..
102 fname2 lname2 D02..
I want to load it into a table with columns emp_id, fname, lname, dept etc.
Thanks!
There are three utilities in Informix to load data to the database from flat files:
Load SQL command. Very simple to use, but not very flexible. I would recommend this for small amounts of records (less than 10k)
Dbload, which is a command line utility, a bit more complex than the load sql command. This will allow you to have more control on how the records are loaded: commit intervals, starting point in the flat file, number of errors before exiting, etc. I'd recommend this utility for a small to medium sized data loads (>10k<100k).
HPL, or High Performance Loader, which is a rather complex utility that can load data at a very high rate of speed, but with a lot of overhead. Recommended for large to x-large data loads.
As ceinmart suggested in comments you can do it from server side or from client side. From server side you can use DB-Access and LOAD command. From client side you can use any tool you like. For such tasks I often use Jython that can use Python string and CSV libraries as well as JDBC database drivers. With Jython you can use csv module to read data from file and PreparedStatement to insert it into database. In my other answer: Substring in Informix you will see such PreparedStatement.
Related
I have input data stored as a single large file on S3.
I want Dask to chop the file automatically, distribute to workers and manage the data flow. Hence the idea of using distributed collection, e.g. bag.
On each worker I have a command line tools (Java) that read the data from file(s). Therefore I'd like to write a whole chunk of data into file, call external CLI/code to process the data and then read the results from output file. This looks like processing batches of data instead of record-at-a-time.
What would be the best approach to this problem? Is it possible to write partition to disk on a worker and process it as a whole?
PS. It nor necessary, but desirable, to stay in a distributed collection model because other operations on data might be simpler Python functions that process data record by record.
You probably want the read_bytes function. This breaks the file into many chunks cleanly split by a delimiter (like an endline). It gives you back a list of dask.delayed objects that point to those blocks of bytes.
There is more information on this documentation page: http://dask.pydata.org/en/latest/bytes.html
Here is an example from the docstring:
>>> sample, blocks = read_bytes('s3://bucket/2015-*-*.csv', delimiter=b'\n')
I am developing an iOS app which stores its data in an sqlite3 database. Every insert, update or delete operation is logged locally and then pushed up to iCloud, and other devices running the app can download these transaction logs and execute the SQL commands within them to keep all devices running the app synchronised. This is working extremely well.
I am now looking into optimising the process and it occurs to me that logging the whole SQL command results in a lot of redundant data being pushed to and pulled from the cloud, which will ultimately result in longer sync times and increased data usage.
The SQL queries are very predictable (there is only one format each of insert, update and delete used in the app) so I am considering using an encoding/decoding routine which will compress the SQL command for storage in the transaction log, and then decompress it from the log for execution.
The string compression methods I have found don't seem to do too well with SQL queries, so I've devised my own:
Single byte to identify the SQL command type
Table and column names indexed in arrays in the app, and the names are encoded using their index position in the array
String of tab separated digits to represent groups of columns, and tab separated values (e.g. in a VALUES() clause)
Encoded check column and value (for the WHERE clause in an update or delete command)
Using this format I have compressed one example query of 186 bytes down to just 78 bytes. This has clear advantages for speed of data transmission and amount of data usage.
The disadvantage I foresee is that it will require more processing on the client end to encode and decode the commands. I am wondering whether anybody has done anything similar and has any advice to offer.
To make is clearer what I am asking: in general is it better to minimise the amount of data being synced and increase the burden on the client to interprete those data, or is it preferable to just sync the data as-is and leave the client to use it as-is?
I'm posting an answer to my own question because I have some information which may provide advice to others who are looking in to the same thing.
I spent some time yesterday writing SQL query compression and decompression functions in Objective C. These functions use the method detailed in my original question and in so doing reduce all non-data parts of the queries (SQL command [insert/update/delete], table names and column names) to a single digit to represent each, and remove all of the remaining SQL syntax (key words like "FROM", spaces, commas, brackets...).
I have done some testing by creating five records both with and without the query compression enabled, and here are the results for the file sizes:
Full SQL queries (logged unmodified to the transactions file)
Zip compressed: 747 bytes
Uncompressed: 1,515 bytes
Compressed SQL queries (compressed using my custom format to the transactions file)
Zip compressed: 673 bytes
Uncompressed: 785 bytes
As you can see, the greatest benefit was to using both types of compression. Encoding the queries achieved ~50% compression. Encoding and then zipping the queries achieved about 10% compression compared to zipping the unencoded queries.
The question I really need to ask myself now is whether it's worth the additional overhead to encode and then zip the queries, and unzip and decode them after downloading the transaction logs from other devices. In this example I only saved 74 bytes overall by encoding and then compressing the transaction log. With five queries this averages at 14.8 bytes saved per query (compared to zipping alone). This is just 14.4kb per 1000 records, which does not seem a lot.
I'm trying to write a Rails action to stream data where the resulting CSV / XML / JSON file is much larger than the memory limit for the web server. The tricky part is that each item in the dataset is composed from two sources. One is a Postgres DB where I plan to open a CURSOR (or just use id > Y LIMIT X) to batch process the data. The latter is a custom data store but there is basically a cursor object I can use to batch that as well.
My problem is I'm not sure what the best way to iterate over the second data source is. I imagine I'll need a structure to open the cursor and as I consume the data in each batch I'll load the next batch.
This problem seems like it might have been solved already so I'm hoping there's an established pattern I can use.
I had an pre-interview task, which I have completed and the solution works, however I was marked down and did not get an interview due to having used a TADODataset. I basically imported a CSV file which populated the dataset, the data had to be processed in a specific way, so I used Filtering and Sorting of the dataset to make sure that the data was ordered in the way I wanted it and then I did the logic processing in a while loop. The feedback that was received said that this was bad as it would be very slow for large files.
My main question here is if using an in memory dataset is slow for processing large files, what would have been better way to access the information from the csv file. Should I have used String Lists or something like that?
It really depends on how "big" and the available resources(in this case RAM) for the task.
"The feedback that was received said that this was bad as it would be very slow for large files."
CSV files are usually used for moving data around(in most cases that I've encountered files are ~1MB+ up to ~10MB, but that's not to say that others would not dump more data in CSV format) without worrying too much(if at all) about import/export since it is extremely simplistic.
Suppose you have a 80MB CSV file, now that's a file you want to process in chunks, otherwise(depending on your processing) you can eat hundreds of MB of RAM, in this case what I would do is:
while dataToProcess do begin
// step1
read <X> lines from file, where <X> is the max number of lines
you read in one go, if there are less lines(i.e. you're down to 50 lines and X is 100)
to process, then you read those
// step2
process information
// step3
generate output, database inserts, etc.
end;
In the above case, you're not loading 80MB of data into RAM, but only a few hundred KB, and the rest you use for processing, i.e. linked lists, dynamic insert queries(batch insert), etc.
"...however I was marked down and did not get an interview due to having used a TADODataset."
I'm not surprised, they were probably looking to see if you're capable of creating algorithm(s) and provide simple solutions on the spot, but without using "ready-made" solutions.
They were probably thinking of seeing you use dynamic arrays and creating one(or more) sorting algorithm(s).
"Should I have used String Lists or something like that?"
The response might have been the same, again, I think they wanted to see how you "work".
The interviewer was quite right.
The correct, scalable and fastest solution on any medium file upwards is to use an 'external sort'.
An 'External Sort' is a 2 stage process, the first stage being to split each file into manageable and sorted smaller files. The second stage is to merge these files back into a single sorted file which can then be processed line by line.
It is extremely efficient on any CSV file with over say 200,000 lines. The amount of memory the process runs in can be controlled and thus dangers of running out of memory can be eliminated.
I have implemented many such sort processes and in Delphi would recommend a combination of TStringList, TList and TQueue classes.
Good Luck
Currently, I'm using LIBXML::SAXParser::Callbacks to parse a large XML file containing data 140,000 products. I'm using a task to import the data for these products into my rails app.
My last import took just under 10 hours to complete:
rake asi:import_products --trace 26815.23s user 1393.03s system 80% cpu 9:47:34.09 total
The problem with the current implementation is that the complex dependency structure in the XML means, I need to keep track of the entire product node to know how to parse it properly.
Ideally, I'd like a way that I could process each product node by itself and have the ability to use XPATH, the file size restricts us from using a method that requires loading the entire XML file into memory. I cannot control the format or size of original XML. I have at most, 3GB worth of memory I can use on the process.
Is there a better way than this?
Current Rake Task code:
Snippet of the XML file:
Can you fetch whole file first? If so, then I'd suggest splitting an XML file into smaller chunks (say, 512MBs or so) so you could parse simultaneous chunks at one time (one per core), 'cause I believe you have modern CPU. Regarding the invalid or malformed xml - just append or prepend missing XML with simple string manipulation.
You can also try profiling your callback method. It's a big chunk of code, I'm pretty sure there should be at least one bottle neck which could save you a few minutes.