Parse flat file with multiple formats

Parse flat file with multiple formats - parsing

Its been a while since Ive had to write what amounts to a custom format edi processor. The last time I wrote one, I was an AS/400 programmer (not iSeries to give you a timeframe). It was pretty easy, I built a structure and inspected the record type column and began processing based on the fix positions of data and record type.
Fast forward to 2012 and I have almost exactly the same requirements except I no longer have an AS/400 to make it easy.
For brevity, the first 2 columns contain a record type and the structure is based on that type. Any suggestion on how to best handle this in c# on a web server?
Some options I have considered are filehelpers and SSIS. I have full control over the environment so I can do pretty much anything that makes sense.

You can try using the Multi Record engine option of FileHelpers
http://www.filehelpers.com/example_multirecords.html
You must define as many record classes as different kind of lines you have and later provide a delegate that lets FileHelpers choose the right one.
There is also a Master Detail engine:
http://www.filehelpers.com/example_masterdetail.html
Last version of the library: http://teamcity.codebetter.com/viewLog.html?buildId=51642&tab=artifacts&buildTypeId=bt66

Related

Saving different sets of values of variables with a changing structure

I have several sets of values (factory setting, user setting...) for a structure of variables and these values are saved in a binary file. So when I want to apply certain setting I just load the specific file containing desired values and these values are applied to the variables accordingly to the structure. This works fine when the structure of variables doesn't change.
I can't figure out how to do it when I add a variable but need to retain the values of the rest (when a structure in a program changes, I need to change the files so that they would contain the new values accordingly to the new structure and at the same time keep the old ones).
I'm using a PLC system that is written in ST language. But I'm looking for some overall approach for solving this issue.
Thank you.

This is not an easy task to provide a solution that is generic and works with different plc platforms. There are many different ways to accomplish this depending on the system/interface you actually want to use e.g. PLC Source Code / OPC / ADS / MODBUS / special functions, addins from the vendor and there are some more possibilities e.g. language features on the PLC. I wrote three solutions to this with C#/ST(with OOP Extensions) and ADS/OPC communication, one with source code parsing first in C#, the other with automatic generation from PLC side and another with an automatic registration system of the parameters with an EntityFramework compatible Database as ParameterStore. If you don't want to invest too much time in this you should try out the parameter management systems that are provided by your plc vendor and live by those restrictions.

Handling Huge data from TAdoQuery in delphi

Programming Language: Delphi 6
SQL Server in The Back end.
Problem:
Application used to hit DB each time we needed something , Ended up hitting it for more than 2000 times for getting certain things, Caused problems with application being slow. This hitting DB happened for a lot of tables each having different structure and different number of columns. So I’m trying to reduce the number of calls.
We can have about 4000 records at a time from each table.
Proposed Solution:
Let’s get all the data from DB at once and use it when we need it so we don’t have to keep hitting the DB.
How Solution is turning out so far:
This version of Delphi doesn’t have a dictionary. So we already have an implementation of a dictionary from String List (Let us assume that implementation is good).
Solution 1:
Store this in a dictionary that we created with:
A unique field as a key.
And Add rest of the data as strings in String List separated like this:
FiledName1:FileValue,FieldName2:FieldValue2,…..
Might Have to create about 2000 String List to map data to key.
I took a look at the following link:
How Should I Implement a Huge but Simple Indexed StringList in Delphi?
Looks like they could move to a different DB not possible with me.
Is this a sane solution?
Solution 2:
Store this in a dictionary with List.
That list will have Delphi Records.
Records can’t directly be added to so I took a look at this Link:
Delphi TList of records
Solution 3:
Or Given that I’m using TAdoQuery should I use Seek or locate to find my records.
Please advice on the best way to do this?
Requirements:
Need Random Access of this data.
Insertion of data will happened only once when we get all the data , per table as we require them.
Need to only read the data , don’t have to modify it.
Constantly need to search in terms of primary key.
In addition to changing the application we have already done good indexing on DB to take care of things from DB side. This is more to make things run well from the application.

This sounds like a perfect use case for TClientDataSet. It's an in-memory dataset that be indexed, filtered, and searched easily, hold any information you can retrieve from the database using a SQL statement, and it has pretty good performance over a few thousand reasonable-sized rows of data. (The link above is to the current documentation, as I don't have one available for the Delphi 6 docs. They should be very similar, although I don't recall which specific version added the ability to directly include MidasLib in your uses clause to eliminate distributing Midas.dll with your app.)
Carey Jensen wrote a series of articles about it a few years back that you might find useful. The first one can be found in A ClientDataset in Every Database Application - the others in the series are linked from it.

Is there a better solution than ActiveRecord for batch data imports?

I've developed a web interface for a legacy (vendor) database using Ruby on Rails. The database schema is a complete mess, > 450 tables, and customer data spread over more than 20, involving complex joins, etc.
I've got a good solution for this for the web app, it works very well. But we also do nightly imports from external data sources (currently a view to a SQL Server DB and a SOAP feed) and they run SLOW. About 1.5-2.5 hours for the XML data import and about 4 hours for the DB import.
This is after doing some basic optimizations, which include manually starting the MRI garbage collector. And that right there suggests to me I'm Doing It Wrong. I've considered moving the nightly update/insert tasks out of the main Rails app and trying to use either JRuby or Rubinius to take advantage of the better concurrency and garbage collection.
My question is this: I know ActiveRecord isn't really designed for this type of task. But out of the O/RM options for Ruby (my preferred language), it seems to have the best Oracle support.
What would you do? Stick with AR and use a different interpreter? Would that really help? What about DataMapper or Sequel? Is there a better way of doing this?
I'm open to using Scala or Clojure if there's a better alternative (not limited to, but these are the other languages I'm playing with right now)... but what I don't want is something like DBI where I'm writing straight SQL, if for no other reason than that vendor updates occasionally change the DB schema, and I'd rather change a couple of classes than hundreds of UPDATE or INSERT statements.
Hopefully this question isn't 'too vague,' but I could really use some advice about this issue.
FWIW, Ruby is 1.9.2, Rails is 3.0.7, platform is OS X Server Snow Leopard (or optionally Debian 6.0).

Edit ok just realized that this solution will not work for oracle, sorry ---
You should really check out ActiveRecord-Import, it is easy to use and handles bulk imports with minimal amounts of sql statements. I saw a speed up from 5 hours to 2 minutes. And it will still run validations on the data.
from the github page:
books = []
10.times do |i|
books << Book.new(:name => "book #{i}")
end
Book.import books
https://github.com/zdennis/activerecord-import

From my experience, ORMs are a great tool to use on the front end, where you're mostly just reading the data or updating a single row at a time. On the back end where you're ingesting lost of data at a time, they can cause problems because of the way they tend to interact with the database.
As an example, assume you have a Person object that has a list of Friends that is long (lets say 100 for now). You create the Person object and assign 100 Friends to it, and then save it to the database. It's common for the naive use of an ORM to do 101 writes to the database (one for each Friend, one for the Person). If you were to do this in pure SQL at a lower level, you'd do 2 writes, one for Person and then one for all the Friends at once (an insert with 100 actual rows). The difference between the two actions is significant.
There are a couple ways I've seen to work around the problem.
Use a lower level database API that lets you write your "insert 100 friends in a single call" type command
Use an ORM that lets you write lower level SQL in order to do the Friends insert as a single SQL command (not all of them allow this and I don't know if Rails does)
Use an ORM that lets you batch writes into a single database call. It's still 101 writes to the database, but it allows the ORM to batch them into a single network call to the database and say "do these 101 things". I'm not sure what ORMs allow for this.
There's probably other ways
The main point being that using the ORM to ingest any real sized amount of data can run into efficiency problems. Understanding what the ORM is doing underneath the hood (asking it to log all db calls is a good way to understand what it's doing) is the best first step. Once you know what it's doing, you can look for ways to tell it "what I'm doing doesn't fit well into the normal pattern, lets change how you're using it"... and, should it not have a way that works, you can look at using a lower level API to allow for it.
I'll point out one other thing you can look at with a STRONG caveat that it should be one of the last things you consider. When inserting rows into the database in bulk, you can create a raw text file with all the data (format depends on the db, but the concept is similar to a CSV file) and give the file to the database to import in bulk. It's a bad way to go in almost every case, but I wanted to include it because it does exist as an option.
Edit: As a side note, the comment about more efficiently parsing the XML is a good thing to look at too. Using SAX vs DOM, or a different XML library, can be a huge win in time to completion. In some cases, it can be an even bigger win than more efficient database interaction. For example, you may be parsing a LOT of XML with lots of small pieces of data, and then only use small parts of it. In a case like that, the parsing could take a long time via DOM while SAX could ignore the parts you don't need... or it could be using a lot of memory creating DOM objects and slow down the whole thing due to garbage collection, etc. At the very least, it's worth looking at.

Since your question is indeed "a bit vague", I can only recommend you optimizing the XML import by using XML Pull parsing.
Take a look at this:
https://gist.github.com/827475
I needed to import MySQL XML, and to be fair, using the XML Pull method improved the parse part in factor of around 7 (yes, almost 7 times faster than reading the entire thing in the memory).
Another thing: you are saying "the DB import takes 4 hours". What file formats are these DB exports you are importing?

Flat file in delphi

In my application I want to use files for storing data. I don't want to use database or clear text file, the goal is to save double and integer values along with string just to identify the name of the record ; I simple need to save data on disk for generating reports. File can grow even to gigabyte. What format you suggest to use? Binary? If so what vcl component/library you know which is good to use? My goal is to create an application which creates and updates the files while another tool will "eat" those file
producing nice pdf reports for user on demand. What do you think? Any idea or suggestion?
Thanks in advance.

If you don't want to reinvent the wheel, you may find all needed Open Source tools for your task from our side:
Synopse Big Table to store huge amount of data - see in particular the TSynBigTableRecord class to store an unlimited number of records with fields, including indexes if needed - it will definitively be faster and use less disk size than any other regular SQL DB
Synopse SQLite3 Framework if you would rather use a standard SQLite engine for the storage - it comes with a full Client/Server ORM
Reporting from code, including pdf file generation
With full Source code, working from Delphi 6 up to XE.
I've just updated the documentation of the framework. More than 600 pages, with details of every class method, and new enhanced general introduction. See the SAD document.
Update: If you plan to use SQLite, you should first guess how the data will be stored, which indexes are to be created, and how a SQL query may speed up your requests. It's a bad idea to read all file content for every request: you should better structure your data so that a single SQL query would be able to return the expended results. Sometimes, using additional values (like temporary sums or means) to the data is a good idea. Also consider using the RTree virtual table of SQLite3, which is dedicated to speed up access to double min/max multi-dimensional data: it may speed up a lot your requests.

You don't want to use a full SQL database, and you think that a plain text file is too simple.
Points in between those include:
Something that isn't a full SQL database, but more of a key-value store, would technically not be a flat file, but it does provide a single "key+value" list, that is quickly searchable on a single primary key. Such as BSDDB. It has the letter D and B in the name. Does that make it a database, in your view? Because it's not a relational database, and doesn't do SQL. It's just a binary key-value (hashtable) blob storage mechanism, using a well-understood binary file format. Personally, I wouldn't start a new project and use anything in this category.
Recommended: Something that uses SQL but isn't as large as standalone SQL database servers. For example, you could use SQLite and a delphi wrapper. It is well tested, and used in lots of C/C++ and Delphi applications, and can be trusted more than anything you could roll yourself. It is a very light embedded database, and is trusted by many.
Roll your own ISAM, or VLIR, which will eventually morph over time into your own in-house DBMS. There are multiple files involved, and there are indexes, so you can look up data fast without loading everything into memory. Not recommended.
The most flat of flat binary fixed-record-length files. You mentioned originally in your question, power basic which has something called Random Access files, and then you deleted that from your question. Probably what you are looking for, especially for append-only write as the primary operation. Roll your own TurboPascal era "file of record". If you use the "FILE OF RECORD" type, you hit the 2gb limit, and there are problems with Unicode. So use TStream instead, like this. Binary file formats have a lot of strikes against them, especially since it is difficult to grow and expand your binary file format over time, without breaking your ability to read old files. This is a key reason why I would recommend you start out with what might at first seem like overkill (SQLite) instead of rolling your own binary solution.
(Update 2: After updating the question to mention PDFs and what sounds like a reporting-system requirement, I think you really should be using a real database but perhaps a small and easy to use one, like firebird, or interbase.)

I would suggest using TClientDataSet, and use it's SaveToFile() / SaveToStream() methods by the generating program, and LoadFromFile() / LoadFromStream() methods for the program that will "consume" the data. That way, you can still make indexed records without connecting to any external database, all while keeping the interchange data in a single file.

Define API to work with your flat file, so that the API can be implemented by a separate data layer in many ways.
Implement the API using standard embedded SQL database (ex SQLite or Firebird).
Only if there is something wrong with the standard solution think of your own.

I use KBMMemtable - see http://www.components4developers.com/ - fast, reliable, been around a long time - supports binary and CSV streaming in and out of files, as well indexing, filters, and lots of other goodies - TClientDataSet will not do well with large datasets.

How do you implement a multiculture web application

I believe several of us have already worked on a project where not only the UI, but also data has to be supported in different languages. Such as - being able to provide and store a translation for what I'm writing here, for instance.
What's more, I also believe several of us have some time-triggered events (such as when expiring membership access) where user location should be taken into account to calculate, like, midnight according to the right time-zone.
Finally there's also the need to support Right to Left user interfaces accoring to certain languages and the use of diferent encodings when reading submitted data files (parsing text and excel data, for instance)
Currently I'm storing all my translations for all my entities on a single table (not so pratical as it is very hard to find yourself when doing sql queries to look into a problem), setting UI translations mainly on satellite assemblies and not supporting neither time zones nor right to left design.
What are your experiences when dealing with these challenges?
[Edit]
I assume most people think that this level of multiculture requirement is just like building a huge project. As a matter of fact if you tihnk about an online survey where:
Answers will collected only until
midnight
Questionnaire definition and part of
the answers come from a text file
(in any language) as well as
translations
Questions and response options must
be displayed in several languages,
according to who is accessing it
Reports also have to be shown and
generated in several different
languages
As one can see, we do not have to go too far in an application to have this kind of requirements.
[Edit2]
Just found out my question is a duplicate
i18n in your projects
The first answer (when ordering by vote) is so compreheensive I have to get at least a part of it implemented someday.

Be very very cautious. From what you say about the i18n features you're trying to implement, I wonder if you're over-reaching.
Notice that the big boy (e.g. eBay, amazon.com, yahoo, bbc) web applications actually deliver separate apps in each language they want to support. Each of these web applications do consume a common core set of services. Don't be surprised if the business needs of two different countries that even speak the same language (e.g. UK & US) are different enough that you do need a separate app for each.
On the other hand, you might need to become like the next amazon.com. It's difficult to deliver a successful web application in one language, much less many. You should not be afraid to favor one user population (say, your Asian-language speakers) over others if this makes sense for your web app's business needs.

Go slow.
Think everything through, then really think about what you're doing again. Bear in mind that the more you add (like Right to Left) the longer your QA cycle will be.

The primary piece to your puzzle will be extensive use of interfaces on the code side, and either one data source that gets passed through a translator to whichever languages need to be supported, or separate data sources for each language.
The time issues can be handled by the interfaces, because presumably you will want things to function in the same fashion, but differ in the implementation details. To a large extent, a similar thought process can be applied to the creation of the interface when adjusting it to support differing languages. When you get down to it, skinning is exactly this, where the content being skinned is the interface, and the look/feel is the implementation.

Do what your users need. For instance, most programmer understand English, there is no sense to translate posts on this site. If many of your users need a translation, add a new table column with the language id, and another column to link a translated row to its original. If your target auditory contains the users from the Middle East, implement Right to Left. If time precision is critical up to an hour, add a time zone column to the user table, and so on.

If you're on *NIX, use gettext. Most languages I've used have some level of support; PHP's is pretty good, for instance.

I'll describe what has been done in my project (it wasn't my original architecture but I liked it anyways)
Providing Translation Support
Text which needs to be translated have been divided into three different categories:
Error text: Like errors which happen deep in the application business layer
UI Text: Text which is shown in the User interface (labels, buttons, grid titles, menus)
User-defined Text: text which needs to be translatable according to the final user's preferences (that is - the user creates a question in a survey and he can also create a translated version of that survey)
For each different cathegory the schema used to provide translation service is different - so that we have:
Error Text: A library with static functions which access resource files
UI Text: A "Helper" class which, linked to the view engine, provides translations from remote assemblies
User-defined Text: A table in the database which provides translations (according to typeID of the translated entity and object id) and is linked to the entity via a 1 x N relationship
I haven't, however, attacked the other obvious problems such as dealing with time zones, different layouts and picture translation (if this is really necessary). Does anyone have tackled this problem in a different way?
Has anyone ever tackled the other i18n problems?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart