How to identify text file format by its structure? - parsing

I have a few text file types with data such as product info, stock, supplier info etc. and they are all structured differently. There is no other identifier for the type except the structure itself (there are no headers, no filename convention etc.)
Some examples of these files:
(products and stocks)
2326 | 542212 | Bananas | 00023 | 1 | pack
2326 | 297875 | Apples | 00085 | 1 | bag
2326 | 028371 | Pineapple | 00007 | 1 | can
...
(products and prices)
12556 Meat, pork 0098.57
58521 Potatoes, mashed 0005.20
43663 Chicken wings 0009.99
...
(products and suppliers - here N is the separator)
03038N92388N9883929
28338N82367N2837912
23002N23829N9339211
...
(product information - multiple types of rows)
VIN|Mom & Pops|78 Haley str.
PIN|BLT Bagel|5.79|FRESH
LID|0239382|283746
... (repeats this type of info for different products)
And several others.
I want to make a function that identifies which of these types a given file is, using nothing but the content. Google has been no help, in part because I don't know what search term to use. Needless to say, "identify file type by content/structure" is of no help, it just gives me results on how to find jpgs, pdfs etc. It would be helpful if I saw some code that others wrote to deal with a similar problem.
What I have thought so far is to make a FileIdentifier class for each type, then when given a file try to parse it and if it doesn't work move on to the next type. But that seems error prone to me, and I would have to hardcode a lot of information. Also, what happens if another format comes along and is very similar to any of the existing ones, but has different information in the columns?

There really is no one-size-fits-all answer unless you can limit the file formats that can happen. You will always only be able to find a heuristic for identifying formats unless you can get whoever designs these formats to give it a unique identifier or you ask the user what format the file is.
That said, there are things you can do to improve your results, like make sure you try all instances of similar formats and then pick the best fit instead of the first match.
The general approach will always be the same: make each decode attempt as strictly as possible, and with as much knowledge about not just syntax, but also semantics. I. e. If you know an item can only contain one of 5 values, or numbers in a certain range, usethat knowledge for detection. Also, don‘t just call strtol() on a component and accept that, check that it parsed the entire string. If it didn‘t, either fail right there, or maintain a „confidence“ value and lower that if a file has any possibly invalid parts.
Then in the end, go through all parse results and pick the one with the highest confidence percentage. Or if you can‘t you can ask the user to pick between the most likely formats.
PS - The file command line tool on Unixes does something similar: It looks at the start of a file and identifies common sequences that indicate certain file formats.

Related

AIML Parser PHP

I am trying to develop Artificial Bot i found AIML is something that can be used for achieving such goal i found these points regarding AIML parsing which is done by Program-O
1.) All letters in the input are converted to UPPERCASE
2.) All punctuation is stripped out and replaced with spaces
3.) extra whitespace chatacters, including tabs, are removed
From there, Program O performs a search in the database, looking for all potential matches to the input, including wildcards. The returned results are then “scored” for relevancy and the “best match” is selected. Program O then processes the AIML from the selected result, and returns the finished product to the user.
I am just wondering how to define score and find relevant answer closest to user input
Any help or ideas will be appreciated
#user3589042 (rather cumbersome name, don't you think?)
I'm Dave Morton, lead developer for Program O. I'm sorry I missed this at the time you asked the question. It only came to my attention today.
The way that Program O scores the potential matches pulled from the database is this:
Is the response from the aiml_userdefined table? yes=300/no=0
Is the category for this bot, or it's parent (if it has one)? this=250/parent=0
Does the pattern have one or more underscore (_) wildcards? yes=100/no=0
Does the current category have a <topic> tag? yes(see below)/no=0
a. Does the <topic> contain one or more underscore (_) wildcards? yes=80/no=0
b. Does the <topic> directly match the current topic? yes=50/no=0
c. Does the <topic> contain a star (*) wildcard? yes=10/no=0
Does the current category contain a <that> tag? yes(see below)/no=0
a. Does the <that> contain one or more underscore (_) wildcards? yes=45/no=0
b. Does the <that> directly match the current topic? yes=15/no=0
c. Does the <that> contain a star (*) wildcard? yes=2/no=0
Is the <pattern> a direct match to the user's input? yes=10/no=0
Does the <pattern> contain one or more star (*) wildcards? yes=1/no=0
Does the <pattern> match the default AIML pattern from the config? yes=5/no=0
The script then adds up all passed tests listed above, and also adds a point for each word in the category's <pattern> that also matches a word in the user's input. The AIML category with the highest score is considered to be the "best match". In the event of a tie, the script will then select either the "first" highest scoring category, the "last" one, or one at random, depending on the configuration settings. this selected category is then returned to other functions for parsing of the XML.
I hope this answers your question.

building a lexer with very many tokens

I've been searching for two hours now And I don't really know what to do.
I'm trying to build a analyser which use a lexer that can match several thousand words. Those are natural language words, that's why they are so many.
I tried first in a simple way with just 1000 differents matches for one token :
TOKEN :
{
<VIRG: ",">
| <COORD: "et">
| <ADVERBE: "vraiment">
| <DET: "la">
| <ADJECTIF: "bonne">
| <NOM: "pomme"
| "émails"
| "émaux"
| "APL"
| "APLs"
| "Acide"
| "Acides"
| "Inuk"
[...]
After javac compilation it returns that the code is too large.
So, how could I manage thousands tokens in my lexer ?
I've read that it is more efficient to use n tokens for each word, than using one token for n words. But in this case I will have rules with 1000+ tokens, which doesn't look like a better idea;
I could modify the token manager, or build one, so it just matchs words in a list;
Here I know that the lexer is a finite state machine, and this is why it's not possible, so is there anyway to use an other lexer ? ;
I could automatically generate a huge regular expression which match every word, but that wouldn't let me handle the words independantly afterwards, and I'm not sure that writing a 60-lines-regular-expression would be a great idea;
Maybe is there a way to load the token from a file, this solution is pretty close to solutions 2 and 3;
Maybe I should use another language ? I'm trying to migrate from XLE (which can handle a lexicon of more than 70 000 tokens) to java, and what is interesting here is to generate java files !
So here it is, I can find my way to handle several thousands tokens with a javacc lexer. That would be great if any one is use to that and have an idea ?
Best
Corentin
I don't know how javacc builds its DFA, but it's certain that a DFA capable of distinguishing thousands of words would be quite large. (But by no means unreasonably large: I've gotten flex to build DFAs with hundreds of thousands of states without major problems.)
The usual approach to lexicons with a huge number of fixed lexemes is to use the DFA to recognize a potential word (eg., a sequence of alphabetical characters) and then look the word up in a dictionary to get the token type. That's also more flexible because you can update the dictionary without recompiling.

Reusing SpecFlow backgrounds, is this right?

I have a little over a month working with SpecFlow and got to a point where I configured a Background Scenario to setup/verify common data on a database, so the next step was trying to reuse the background for several feature files, to avoid cutting and pasting.
It has been asked before but I expected something else, more user-friendly, just as the Background scenario is easy to understand and update:
Background:
Given I have created the following currencies:
| Code | Name |
| USD | United States Dollar |
| EUR | Euro |
And I have created the following countries:
| Code | Currency | Name |
| US | USD | United States |
| ES | EUR | Spain |
| IT | EUR | Italy |
I found a quite naive solution that is working (or at least seems to, so far), but I'm concerned it may lead me the wrong way, because of my shallow knowledge of SpecFlow.
Taking a look at the generated code for a feature file I got to this:
Create a "feature" file that only has the background scenario, named something like "CommonDataSetup"
Create a step definition like:
[Given(#"common data configuration has been verified")]
public void GiveCommonDataConfigurationHasBeenVerified()
{
// this class is inside the generated feature file
var commonSetup = new CommonDataSetupFeature();
var scenarioInfo = new ScenarioInfo("Common data configuration", ((string[])(null)));
commonSetup.FeatureSetup();
commonSetup.ScenarioSetup(scenarioInfo);
commonSetup.FeatureBackground();
commonSetup.ScenarioCleanup();
commonSetup.FeatureTearDown();
}
In the Background of the other feature files write:
Background:
Given common data configuration has been verified
So now I can reuse the "common data configuration" step definition in as many feature files I need keeping DRY, and background scenarios can be much shorter.
I seems to work fine, But I wonder, is this a the right way to achieve background reuse?
Thanks in advance.
If you have a conversation with a business person who wants the feature, they probably don't say "Given common data configuration has been verified..."
They probably say something like, "Okay, you've got your standard currencies and country codes..."
Within that domain, as long the idea of standard countries and currencies is really well-known and understood, you don't need to include it. It has to be the case that every single person on the team is familiar with these, though. The whole business needs to be familiar with them. If they're that completely, totally familiar, then re-introducing a table full of them at the beginning of every scenario would be waste.
Anything you can do to eliminate that waste and get to the interesting bits of the scenario is good. Remember that the purpose of the conversations is to surface uncertainty and misunderstandings, and nobody's likely to get these wrong. The automation is a record of those conversations, and you really don't even need to have much of a conversation for this step.
Do have the conversations, though. Even if it's just one line and everyone knows what it is, using the business language for it is important. Without that, you'll end up discussing these really boring bits to try and work out what you each mean by "common data configuration" and "verify" before you can move on to the interesting parts of the scenarios.
Short version: I'd expect to see something like:
Given standard currencies and country codes
When...
You don't even need to use background for that, and however you implement it is fine. If you have a similar situation with standard data that's slightly less familiar, then include it in each feature file; it's important not to hide magic. Remember that readability trumps DRY in tests (which are really records of conversations).
I understand where the need comes, but reusing the same background in different feature files is against the idea behind Gherkin.
See https://github.com/cucumber/cucumber/wiki/Gherkin
Gherkin is the language that Cucumber understands. It is a Business Readable, Domain Specific Language that lets you describe software’s behaviour without detailing how that behaviour is implemented.
With the "Given common data configuration has been verified" step it is not more business readable.
Additional your current implementation messes with the internal state of SpecFlow. It is now somehow working, but when you will get in trouble with it.
If you need something setup in every test, did you had a look at the various Hooks?
http://www.specflow.org/documentation/Hooks/
With an [BeforeScenario]- hook you could setup your tests.

Should BDD scenarios include actual test data, or just describe it?

We've come to a point where we've realised that there are two options for specifying test data when defining a typical CRUD scenario:
Option 1: Describe the data to use, and let the implementation define the data
Scenario: Create a region
Given I have navigated to the "Create Region" page
And I have typed in a valid name
And I have typed in a valid code
When I click the "Save" button
Then I should be on the "Regions" page
And the page should show the created region details
Option 2: Explicitly state the test data to use
Scenario: Create a region
Given I have navigated to the "Create Region" page
And I have filled out the form as follows
| Label | Value |
| Name | Europe |
| Code | EUR |
When I click the "Save" button
Then I should be on the "Regions" page
And the page should show the following fields
| Name | Code |
| Europe | EUR |
In terms of benefits and drawbacks, what we've established is that:
Option 1 nicely covers the case when the definition of say a "valid name" changes. This could be more difficult to deal with if we went with Option 2 where the test data is in several places. Option 1 explicitly describes what's important about the data for this test, especially if it were a scenario where we were saying something like "has typed in an invalid credit card number". It also "feels" more abstract and BDD somehow, being more concerned with description than implementation.
However, Option 1 uses very specific steps which would be hard to re-use. For example "the page should show the created region details" will probably only ever be used by this scenario. Conversely we could implement Option 2's "the page should show the following fields" in a way that it could be re-used many times by other scenarios.
I also think Option 2 seems more client-friendly, as they can see by example what's happening rather than having to interpret more abstract terms such as "valid". Would Option 2 be more brittle though? Refactoring the model might mean breaking these tests, whereas if the test data is defined in code the compiler will help us with model changes.
I appreciate that there won't be a right or wrong answer here, but would like to hear people's opinions on how they would decide which to use.
Thanks!
I would say it depends. There are times when a Scenario might require a large amount of data to complete a successful run. Often the majority of that data is not important to the thing we are actually testing and therefore becomes noise distracting from the understanding we are trying to achieve with the Scenario. I started using something I call a Default Data pattern to provide default data that can be merged with data specific to the Scenario. I have written about it here:
http://www.cheezyworld.com/2010/11/21/ui-tests-default-dat/
I hope this helps.
I prefer option 2.
To the business user it is immediately clear what the inputs are and the outputs. With option 1 we don't know what valid data is, so your implementation may be wrong.
You can be even more expressive by adding invalid data too, when appropriate
Scenario: Filter for Awesome
Given I have navigated to the "Show People" page
And I have the following data
| Name | Value |
| John | Awesome|
| Bob | OK |
| Jane | Fail |
When I click the "Filter" button
Then the list should display
| Name | Value |
| John | Awesome |
You should however keep the data so its described in terms of the domain, rather that the specific implementation. This will allow you to test at different layers in your application. e.g. UI Service etc..
Every time I think about this I change my mind. But if you think about it - the test is to prove that you can create a region. A Criteria met by both options. But I agree that the visual cues with option 2 and developer friendliness are probably too good to turn down. In examples like this, at least.
I would suggest you take a step back and ask what stories and rules you are trying to illustrate with these scenarios. If there are rules about what makes a valid or invalid region code, and your stakeholders want to describe those using BDD, then you can use specific examples of valid and invalid region codes. If you want to describe what can happen after a region is created, then the exact data is not so interesting.
Your "Create a region" is not actually typical of the scenarios that we use in BDD. It can be characterised as "when I create a thing, then I can see the thing". It's not a useful scenario in that it doesn't by itself deliver anything valuable to the user. We look for scenarios in which something interesting or valuable is delivered to the end-user. Why is the user creating a region? What is the end goal? So that another user can assign other objects to that region, perhaps?
Example mapping, where stories are linked with rules and examples (where the examples become scenarios), is described in https://cucumber.io/blog/bdd/example-mapping-introduction/

Parsing semi-structured data - can I use any classifiers?

I've got a set of documents which have a semi-regular format. Rows are typically separated by new line characters, and the main components of each row are separated by spaces. Some examples are a set of furniture assembly instructions, a set of table of contents, a set of recipes and a set of bank statements.
The problem is that each specimen in each set is different from its peer members in ways which make RegEx parsing infeasible: the quantity of an item may come before or after the item name, the same items may have different names between specimens, expository text or notes may exist between rows, etc.
I've used classifiers (Neural Nets, Bayesian, GA and GP) to deal with whole documents or data sets, but not to extract items from documents and classify them within a context. Can this be done? Is there a more feasible approach?
If your data has structure, arguably you can use a grammar to describe some of that structure. (Classically you use grammars to recognize what they can, often too much, and extra-grammatical checks to prune away what the grammars cannot eliminate).
If you use a grammar that can run parallel potential parses, which eliminate parses as they become infeasible,
you can parse different ordering straightforwardly. (A GLR parser can do this nicely).
Imaging you have NUMBERS describing amounts, NOUNS describing various objects, and VERBS for actions.
Then a grammar that can accept varying orders of items might be:
G = SENTENCE '.' ;
SENTENCE = VERB NOUN NUMBER ;
SENTENCE = NOUN VERB NUMBER;
VERB = 'ORDER' | 'SAW' ;
NUMBER = '1' | '2' | '10' ;
NOUN = 'JOE' | 'TABLE' | 'SAW' ;
This sample is extremely simple, but it will handle:
JOE ORDERED 10.
JOE SAW 1.
ORDER 2 SAW.
It will also accept:
SAW SAW 10.
You can eliminate this by adding an external constraint that actors must be people.
There are plenty of methods to do that. It is an active research area called: information extraction. In particular information extraction from semi-structured sources.

Resources