How to create dynamic parser? - parsing

I want to create something called dynamic parser.
My project input is some data file like XML, Excel, CSVand ... file and I must parse it and extract its records and its fields and finally save it to SQL Server database.
My problem is that fields of the record is dynamic and I can not write parser in development time. I must provide parser in run-time. By dynamic I mean a user select each record fields using a Web UI. So, I know the numbers of fields in each record in run-time and some information about each field like its name and so on.
I discussed this type of project in question titled 'Design Pattern for Custom Fields in Relational Database'.
I also looked at Parser Generator but i did not get enought information about it and I don't know it is really related to my problem or not.
Is there any design pattern for this type of problem?

If you know the number of fields and the field names then extract the data from the file and then build a query using string concatenation

Related

Partial deserialization with Apache Avro

Is it possible to deserialize a subset of fields from a large object serialized using Apache Avro without deserializing all the fields? I'm using GenericDatumReader and the GenericRecord contains all the fields.
I'm pretty sure you can't do it using GenericDatumReader, but my question is whether it is possible given the binary format of Avro.
Conceptually, binary serialization of Avro data is in-order and depth-first. As you traverse the data, record fields are serialized one after the other, lists are serialized from the top to the bottom, etc.
Within one object, there no markers to separate fields, no tags to identify specific fields, and no index into the binary data to help quickly scan to specific fields.
Depending on your schema, you could write custom code to skip some kinds of data ... for example, if a field is a LIST of FIXED bytes, you could read the size of the list and just jump over the data to the next field. This is pretty specific and wouldn't work for most Avro types though (notably integers are variable length when encoded).
Even in that unlikely case, I don't believe there are any helpers in the Java SDK that would be useful.
In brief, Avro isn't designed to do that, and you're probably not going to find a satisfactory way to do a projection on your Schema without deserializing the entire object. If you have a collection, column-oriented persistence like Parquet is probably the right thing to do!
It is possible if the fields you want to read occur first in the record. We do this in some cases where we want to read only the header fields of an object, not the full data which follows.
You can create a "subset" schema containing just those first fields, and pass this to GenericDatumReader. Avro will deserialise those fields, and anything which comes after will be ignored, because the schema doesn't "know" about it.
But this won't work for the general case where you want to pick out fields from within the middle of a record.

Classifying words inside a document

The problem that I'm facing is:
I want to read a document, get the raw string of this document, and classify the information.
For example, I want to identify when the string is a "Name", or a "date" ou some other useful information.
Is it possible to use machine learning to do that?
How may I approach the problem?
The most hard problem here is that I'm not trying to classify the document itself, but the String information inside the document.
So it's all about how you think about your problem. I think your problem can be formulated as an entity extraction/recognition problem, where you have a document and want to identify specific entities within the text (where an entity might be a person, date, etc). Take a look at Conditional Random Fields and their applications to named entity recognition (NER for short), as there are some libraries & tools already implemented.
For example, check out StanfordNER.

Parsing field name using Crystal Reports

I have a customer who really wants to keep a very long naming convention during a migration to a new database. The new database uses Crystal Reports for reporting. I have gotten an ok to shorten the naming convention somewhat to "shortened name-date" with all of the other pertinent information parsed out into new fields.
However, one of the users who does a lot of the reporting has now said that one of the most tedious parts of her job was parsing out the old names so she could have a simple, high level, parent name for executive reports. With the new naming convention, she will still need to parse the field to get just the shortened name as her executive-level parent name. If I can't manage to get the ok to drop the date from this field, can Crystal reports be used to parse the field at the "-" similar to parsing the data using Excel? What I'm looking for is that her reports would have a formula that generates the executive-level short name behind the scenes so she doesn't have to think about it.
The date already exists in a date field, so parsing out the date from the name would not change other report functionality. Ideally, I would want to enter the data already separated out and concatenate fields per each user's particular needs, but I may not be able to do. Any info would be much appreciated.
Thank you.
I think you are looking for this...
Split({fieldname},"-")[1]

form-only lookup

How can I create a form-only look up in Informix 4GL? I am using form painter plus the informix SE. Any help would be appreciated. I tried to create the form but the field is empty while selecting the choice. I think I am missing the relation or something.
FORMONLY is the equivalent of DISPLAYONLY in isql perform screens. Why not just define the database columns in the attributes section and use the NOUPDATE attribute for each column, or use BEFORE EDITUPDATE OF tabname, ABORT?
Since I4GL doesn't come with a form painter, the only ways to know what you can do with it is by reading the manual for your form painter, or by experimenting.
I'm also not entirely sure what you mean by a FORMONLY lookup? It could be any of a number of items. But the basics are that the field in the form is FORMONLY.fieldname TYPE xyz where xyz is the appropriate type. You use a CONSTRUCT or INPUT to get data into that field; you process the input to do the lookup. INPUT is more appropriate for an exact value lookup; CONSTRUCT will allow more flexible querying.
Since you've not shown what you've tried, nor indicated which form painter you're using, it is going to be hard to help further.
(And I note you've asked this question on the IIUG (International Informix Users Group) mailing list for 'classics' too.)

User-adjustable data structures

assume a data structure Person used for a contact database. The fields of the structure should be configurable, so that users can add user defined fields to the structure and even change existing fields. So basically there should be a configuration file like
FieldNo FieldName DataType DefaultValue
0 Name String ""
1 Age Integer "0"
...
The program should then load this file, manage the dynamic data structure (dynamic not in a "change during runtime" way, but in a "user can change via configuration file" way) and allow easy and type-safe access to the data fields.
I have already implemented this, storing information about each data field in a static array and storing only the changed values in the objects.
My question: Is there any pattern describing that situation? I guess that I'm not the first one running into the problem of creating a user-adjustable class?
Thanks in advance. Tell me if the question is not clear enough.
I've had a quick look through "Patterns of Enterprise Application Architecture" by Martin Folwer and the Metadata Mapping pattern describes (at quick glance) what you are describing.
An excerpt...
"A Metadata Mapping allows developers to define the mappings in a simple tabular form, which can then be processed bygeneric code to carry out the details of reading, inserting and updating the data."
HTH
I suggest looking at the various Object-Relational pattern in Martin Fowler's Patterns of Enterprise Application Architecture available here. This is a list of patterns it covers here.
The best fit to your problem appears to be metadata mapping here. There are other patterns, Mapper, etc.
The normal way to handle this is for the class to have a list of user-defined records, each of which consists of list of user-defined fields. The configuration information forc this can easily be stored in a database table containing the a type id, field type etc, The actual data is then stored in a simple table with the data represented only as (objectid + field index)/string pairs - you convert the strings to and from the real type when you read or write the database.

Resources