Parsing and validating arbitrary date formats in ruby (on rails) - ruby-on-rails

I have a requirement to handle custom date formats in an existing app. The idea is that the users have to do with multiple formats from outside sources they have very little control over. We will need to be able to take the format and both validate Dates against it, as well as parse strings specifically in that format. The other thing is that these can be completely arbitrary, like JA == January, FE == February, etc...
to my understanding, chronic only handles parsing (and does it in a more magical way then I can use), and enter code here DateTime#strptime comes close, but doesn't really handle the whole two character month scenario, even with custom formatters. The 'nuclear' option is to write in custom support for edge cases like this, but I would prefer to use a library if something like this exists.

I don't think something that handles all these problems exists if the format is really very arbitrary. It would probably be easiest to "mold" your input into a form that can be handled by Date.parse, Date.strptime, or another existing tool, even though that could mean quite a bit of work.
How many different formats are we talking about? Do any of them conflict? It seems like you could just gsub like so: input_string.gsub(/\bJA\b/i, 'January'). Is this part of an import routine, or are the users going to be typing in dates in different formats?
There's a related question here: Parse Italian Date with Ruby

Related

How to display the internationalization "second"/"seconds" string for a number?

I am using Ruby on Rails 4 and, given a number, I would like to display the internationalization "second"/"seconds" string for that number. That is, I have a number (for example, 1 or 20) and I would like to display 1 second or 20 seconds (in english).
I know the date helpers but no method seems to fit for my case. How can I make that?
The usual t function eventually ends up inside the i18n gem's translate method. translate, like any sensible i18n/l10n tool, already knows about the current locale's pluralization rules. That means that you should just tell the translation system which message/string you want to how many of them you have, something like:
t('message-identifier', :count => n)
Then t will use the appropriate pluralization rules for n things in the current locale.
I use gettext for all my translation needs and it behaves this way. But there's no possible way that t wouldn't work this way too; it must work this way or it is utterly useless.

Time of Day in the JSON response model?

I am using ASP.NET Web Api 2 with Json.NET 6.0.1.
According to ISO 8601, dates should be interchanged in a certain way. I am using the IsoDateTimeConverter() in order to achieve this:
config.Formatters.JsonFormatter.SerializerSettings.Converters.Add(new IsoDateTimeConverter());
But how should "time of day" be returned in a JSON response model?
I cannot find anything for this in the ISO specification.
Should time perhaps be returned as a:
TimeSpan? (with expectation of the user to not use this as a duration representation)
DateTime? (with expectation of the user to drop off the date part)
A custom Time class
There is no standard structure in JSON for containing dates or times (see JSON.org). The de-facto stardard for dates-time values is using a string in ISO 8601 format, as you mentioned. But since there is no official standard it really comes down to what works best for you and consumers of your API.
Using a DateTime object is a reasonable choice because the support already exists in Json.Net and other serializers for converting these to and from ISO 8601 strings. So this would be the easiest to implement. However, users of your API would have to know to disregard the date portion, as you said. You could set the date to 0001-01-01 to emphasize its irrelevance. This isn't so different from the more common situation where you need only a date in your API and the time doesn't matter. Most people just set the time to midnight in this case and let it go. But, I would agree that this approach does seem to have a little bit of a "code smell" to it, given that part of the value is just noise.
Perhaps a cleaner idea is to format your DateTime value as ISO 8601, but then chop off the date portion before returning it. So users of the API would get a string that looks like 14:35:28.906Z. You could write a simple JsonConverter to handle this for you during serialization. This would sort of give you the best of both worlds -- a cleaner API, but you still can work with the familiar DateTime struct internally.
A custom Time class could also work here, but might be overkill, depending. If you do need to go there, you might want to look into a third-party library such as Noda Time, which has classes already built for these kinds of things, and also has pre-built converters for Json.Net.
I would definitely not choose TimeSpan for this purpose. Wrong tool for the job.

How to use package:intl

I want to use package:intl to make multi lang html page.
I have seen example/basic_example.dart but don't find document of message_lookup_by_library.dart, intl_helpers.
Is there a simple example use package:intl.
Parts of that are still work in progress, e.g. there should be an update so that plurals and genders work in the next few days. To see the basic workflow, take a look at the intl/test/message_extraction directory and specifically message_extraction_test.dart That does a round-trip extraction, translation, code generation, and running the result. The end result is roughly what you see in basic_example.dart, but it actually does the intermediate steps to produce that.
That test uses a trivial JSON format for output and the translations are hard-coded. You could manually translate things using that format, but for real usage you would probably want to use a real translation tool and a format that it understood. There is a little bit of documentation at http://api.dartlang.org/docs/releases/latest/intl/Intl.html but for the time being you're probably stuck looking at the code and/or asking questions.

Profanity filter import

I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb

Will ANTLR Help? Different Suggestion?

Before I dive into ANTLR (because it is apparently not for the faint of heart), I just want to make sure I have made the right decision regarding its usage.
I want to create a grammar that will parse in a text file with predefined tags so that I can populate values within my application. (The text file is generated by another application.) So, essentially, I want to be able to parse something like this:
Name: TheFileName
Values: 5 3 1 6 1 3
Other Values: 5 3 1 5 1
In my application, TheFileName is stored as a String, and both sets of values are stored to an array. (This is just a sample, the file is much more complicated.) Anyway, am I at least going down the right path with ANTLR? Any other suggestions?
Edit
The files are created by the user and they define the areas via tags. So, it might look something like this.
Name: <string>TheFileName</string>
Values: <array>5 3 1 6 1 3</array>
Important Value: <double>3.45</double>
Something along those lines.
The basic question is how is the file more complicated? Is it basically more of the same, with a tag, a colon and one or more values, or is the basic structure of the other lines more complex? If it's basically just more of the same, code to recognize and read the data is pretty trivial, and a parser generator isn't likely to gain much. If the other lines have substantially different structure, it'll depend primarily on how they differ.
Edit: Based on what you've added, I'd go one (tiny) step further, and format your file as XML. You can then use existing XML parsers (and such) to read the files, extract data, verify that they fit a specified format, etc.
It depends on what control you have over the format of the file you are parsing. If you have no control then a parser-generator such as ANTLR may be valuable. (We do this ourselves for FORTRAN output files over which we have no control). It's quite a bit of work but we have now mastered the basic ANTLR lexer/parser strategy and it's starting to work well.
If, however, you have some or complete control over the format then create it with as much markup as necessary. I would always create such a file in XML as there are so many tools for processing it (not only the parsing, but also XPath, databases, etc.) In general we use ANTLR to parse semi-structured information into XML.
If you don't need for the format to be custom-built, then you should look into using an existing format such as JSON or XML, for which there are parsers available.
Even if you do need a custom format, you may be better off designing one that is dirt simple so that you don't need a full-blown grammar to parse it. Designing your own scripting grammar from scratch and doing a good job of it is a lot of work.
Writing grammar parsers can also be really fun, so if you're curious then you should go for it. But I don't recommend carelessly mixing learning exercises with practical work code.
Well, if it's "much more complicated", then, yes, a parser generator would be helpful. But, since you don't show the actual format of your file, how could anybody know what might be the right tool for the job?
I use the free GOLD Parser Builder, which is incredibly easy to use, and can generate the parser itself in many different languages. There are samples for parsing such expressions also.
If the format of the file is up to the user can you even define a grammar for it?
Seems like you just want a lexer at best. Using ANTLR just for the lexer part is possible, but would seem like overkill.

Resources