How would you parse an email without using using a library like TMail that does parsing for you using Ruby on Rails?
Since this is for an assignment I will answer with some resources that outline the format email should take:
There have been several email specifications
https://www.rfc-editor.org/rfc/rfc822 - Outlines the first version (in 1982)
https://www.rfc-editor.org/rfc/rfc2822 - Updated this (2001)
https://www.rfc-editor.org/rfc/rfc5322 - Was the latest update (2008)
Depending on how much you actually need to implement I would suggest starting with the first as it was the simplest. These outline the patterns and format that the message will be in.
To start with you will find the headers then two carriage returns followed by the message body. There are a couple of options here. It's quite possible to start with regular expressions or even just pattern matching but it may be worth looking into Parsing Expression Grammars in more detail (treetop for example).
I hope this gets you started.
Related
I am trying to use ANTLR as a parser for my company's latest project. I have been unable to find any information on how to parse one number, say (0005039906179210835699175654) into multiple tokens (a 5 digit number, a 3 digit number, a 14 digit number, and a 6 digit number).
My current code spits back an error,
line 1:1 no viable alternative at input '0005039906179210835699175654'
Also, on another note, does anyone know how to get the name of a token by using a listener? That's just a bonus question I guess :) Thanks in advance to everyone who responds!
EDIT:
To clarify the whole problem, my company receives information in a legacy format from automated systems. This information must be parsed into POJOs for further processing. I am trying to use ANTLR as an easy, smooth, readable, and expandable solution to this. One example is this line:
U0005138606179090232769522950 0863832 18322862 0284785 3
Which must be parsed into the sections: U, 00051, 386, 06179090232769, 522950, 0863832, 18322862, 0284785, and 3. Obviously the sections separated by white space are easy to parse but I have been unable to find a way in ANTLR to parse the values not separated by white space. Any help would be appreciated, thanks!
EDIT2:
To be perfectly clear as to why I'm using ANTLR instead of just java, my company receives messages in 5 legacy formats, and the system implemented to parse them must be easily expandable to accommodate more in the future. ANTLR is easy to read and understand. Plus, it will be easier to construct additional grammars and listeners than try to maintain a random mess of java.
EDIT3:
I thought of a solution but it is pretty janky. My idea is to parse the 28 character number as one token, then split it using java from a listener since it is broken up the same way each time. I'll report back later today on whether I got it to work.
EDIT4, FINAL UPDATE:
I have chosen to go my solution mentioned in edit3. It is not pretty, but it works and it is fast enough. Thank your very much to everyone who commented, shared ideas, and stimulated thought!
I'm working with an application that receives all text from an invoice (text was get by processing the scanned image of that invoice). Now, because there're several invoice formats that are available so I need to categorize what format the application is receiving. For example some format may contains number of unit, some don't (but they both have total cost).
I did some research on parsing techniques but found no workable solution for this. Do you have any suggestion for this type of problems?
In Perl, you can use Marpa, a general BNF parser — describe your invoice format in BNF and Marpa will parse your invoices accordng to the BNF, see e.g. how it tackled this complex example with this simple code.
Using a parser generator I want to create a parser for "From headers" in email messages. Here is an example of a From header:
From: "John Doe" <john#doe.org>
I think it will be straightforward to implement a parser for that.
However, there is a complication in the "From header" syntax: comments may be inserted just about anywhere. For example, a comment may be inserted within "john":
From: "John Doe" <jo(this is a comment)hn#doe.org>
And comments may be inserted in many other places.
How to handle this complication? Does it require a "2-pass" parser: one pass to remove all comments and a second pass to create the parse tree for the From header? Do modern parser generators support multiple passes on the input? Can it be parsed in a single pass? If yes, would you sketch the approach please?
I'm not convinced that your interpretation of email addresses is correct; my reading of RFC-822 leads me to believe that a comment can only come before or after a "word", and that "word"s in the local-part of an addr-spec need to be separated by dots ("."). Section 3.1.4 gives a pretty good hint on how to parse: you need a lexical analyzer which feeds syntactic symbols into the parser; the lexical analyzer is expected to unfold headers, ignore whitespace, and identify comments, quoted strings, atoms, and special characters.
Of course, RFC-822 has long been obsoleted, and I think that email headers with embedded comments are anachronistic.
Nonetheless, it seems like you could easily achieve the analysis you wish using flex and bison. As indicated, flex would identify the comments. Strictly speaking, you cannot identify comments with a regular expression, since comments nest. But you can recognize simple nested structures using a start condition stack, or even more economically by maintaining a counter (since flex won't return until the outermost parenthesis is found, the counter doesn't need to be global.)
Facebook just re-launched Comments, with a automatic grammar fixing feature.
What does the grammar filter do?
Adds punctuation (e.g. periods at the end of sentences)
Trims extra whitespace Auto cases words (e.g. capitalize the first word of a
sentence)
Expands slang words (e.g. plz becomes please)
Adds a space
after punctuation (e.g. Hi,Cat would become Hi, Cat)
Fix common
grammar mistakes (e.g. convert ‘dont' to ‘don’t’)
What is an equivalent plugin or gem?
I don't know of anything with those particular features.
However, you might look at Ruby LinkParser, which is a Ruby wrapper for the Link Grammar parser developed by academics and used by the Abiword project for grammar checking. (Note that "link" in Link Grammer parser doesn't refer to HTML links, but rather to a structure that describes English syntax as a set of links between words).
Here's another interesting checker, written in Ruby, which is designed to check LaTex files for some of the problems you mention (plus others).
"After the Deadline" is a complete (free) grammar checking service. Someone has already written a Ruby wrapper for it.
https://github.com/msepcot/after_the_deadline
You may be interested in Gingerice which seems to do what you are looking for!
I have a requirement to handle custom date formats in an existing app. The idea is that the users have to do with multiple formats from outside sources they have very little control over. We will need to be able to take the format and both validate Dates against it, as well as parse strings specifically in that format. The other thing is that these can be completely arbitrary, like JA == January, FE == February, etc...
to my understanding, chronic only handles parsing (and does it in a more magical way then I can use), and enter code here DateTime#strptime comes close, but doesn't really handle the whole two character month scenario, even with custom formatters. The 'nuclear' option is to write in custom support for edge cases like this, but I would prefer to use a library if something like this exists.
I don't think something that handles all these problems exists if the format is really very arbitrary. It would probably be easiest to "mold" your input into a form that can be handled by Date.parse, Date.strptime, or another existing tool, even though that could mean quite a bit of work.
How many different formats are we talking about? Do any of them conflict? It seems like you could just gsub like so: input_string.gsub(/\bJA\b/i, 'January'). Is this part of an import routine, or are the users going to be typing in dates in different formats?
There's a related question here: Parse Italian Date with Ruby