I did some searching but haven't landed anything that looks useful yet but I am wondering if anyone knows of something (tool,lib etc) that can parse English phrases and translate them into a cron string.
For example: Every Tuesday at 15:00 converts to 0 15 * * 2
It seems like something that would have lots of gotchas and it would be preferable to benefit from someone elses work. You see it in a few nice sites/apps that can work out what you mean from a simple phrase rather than having some hideous user interface.
Thanks in advance.
Though this is an old question I would like to list out all libraries/tools that I know so far, so that this answer might help others who arrive on this page looking for the same thing:
JavaScript:
natural-cron.js (link)
friendly-cron (link)
PHP:
natural-cron-expression (link)
Ruby:
whenever (link)
Feel free to reply in comments, if you know about any other library which is not listed here :)
(Full disclosure: natural-cron.js has been developed by me & my friend, when no other library satisfied the needs of our project)
For Ruby there's "Whenever" which might provide a starting point: It translates quasi-english (actually it's valid Ruby) into cron strings.
Depending on how flexible you need it to be, and how willing to roll up your own sleeves you are, you could define a simple grammar for this.
Every would be a quantifier. You may need others but I can't think of any. Valid syntax might be:
Every (day-spec) AT (time)
Where day-spec could be a literal day (ie. Monday) or be a day of the month (ie. 30th Day) or some other syntax (I'd suggest fortnights but I'm not sure if Cron can represent those well).
Time could be specified using either 24 hour (16:00) or 12 hour (4:00pm) format.
Another syntax that you might want is:
Every (frequency) From (time) where frequency is basically (quantity) (unit) (ie. 10 Minutes). The from time enables you to set an offset (eg. Every 30 Minutes From 01:10am).
You'd probably need to sit down and figure out these details a bit more. But a rigid grammar could be implemented relatively easily using recursive descent.
Hmm, about those gotchas... How about also writing one that translates the cron params back to English? That way you can see if the parser "understood" you.
Related
I am looking to write a basic profanity filter in a Rails based application. This will use a simply search and replace mechanism whenever the appropriate attribute gets submitted by a user. My question is, for those who have written these before, is there a CSV file or some database out there where a list of profanity words can be imported into my database? We are submitting the words that we will replace the profanities with on our own. We more or less need a database of profanities, racial slurs and anything that's not exactly rated PG-13 to get triggered.
As the Tin Man suggested, this problem is difficult, but it isn't impossible. I've built a commercial profanity filter named CleanSpeak that handles everything mentioned above (leet speak, phonetics, language rules, whitelisting, etc). CleanSpeak is capable of filtering 20,000 messages per second on a low end server, so it is possible to build something that works well and performs well. I will mention that CleanSpeak is the result of about 3 years of on-going development though.
There are a few things I tell everyone that is looking to try and tackle a language filter.
Don't use regular expressions unless you have a small list and don't mind a lot of things getting through. Regular expressions are relatively slow overall and hard to manage.
Determine if you want to handle conjugations, inflections and other language rules. These often add a considerable amount of time to the project.
Decide what type of performance you need and whether or not you can make multiple passes on the String. The more passes you make the slow your filter will be.
Understand the scunthrope and clbuttic problems and determine how you will handle these. This usually requires some form of language intelligence and whitelisting.
Realize that whitespace has a different meaning now. You can't use it as a word delimiter any more (b e c a u s e of this)
Be careful with your handling of punctuation because it can be used to get around the filter (l.i.k.e th---is)
Understand how people use ascii art and unicode to replace characters (/ = v - those are slashes). There are a lot of unicode characters that look like English characters and you will want to handle those appropriately.
Understand that people make up new profanity all the time by smashing words together (likethis) and figure out if you want to handle that.
You can search around StackOverflow for my comments on other threads as I might have more information on those threads that I've forgotten here.
Here's one you could use: Offensive/Profane Word List from CMU site
Based on personal experience, you do understand that it's an exercise in futility?
If someone wants to inject profanity, there's a slew of words that are innocent in one context, and profane in another so you'll have to write a context parser to avoid black-listing clean words. A quick glance at CMU's list shows words I'd never consider rude/crude/socially unacceptable. You'll see there are many words that could be proper names or nouns, countries, terms of endearment, etc. And, there are myriads of ways to throw your algorithm off using L33T speak and such. Search Wikipedia and the internets and you can build tables of variations of letters.
Look at CMU's list and imagine how long the list would be if, in addition to the correct letter, every a could also be 4, o could be 0 or p, e could be 3, s could be 5. And, that's a very, very, short example.
I was asked to do a similar task and wrote code to generate L33T variations of the words, and generated a hit-list of words based on several profanity/offensive lists available on the internet. After running the generator, and being a little over 1/4 of the way through the file, I had over one million entries in my DB. I pulled the plug on the project at that point, because the time spent searching, even using Perl's Regex::Assemble, was going to be ridiculous, especially since it'd still be so easy to fool.
I recommend you have a long talk with whoever requested that, and ask if they understand the programming issues involved, and low-likelihood of accuracy and success, especially over the long-term, or the possible customer backlash when they realize you're censoring them.
I have one that I've added to (obfuscated a bit) but here it is: https://github.com/rdp/sensible-cinema/blob/master/lib/subtitle_profanity_finder.rb
I thought the Time object in Ruby on Rails stores the time but when I ran Time.now.beginning_of_day it gives me the date as well. I'm just trying to capture the time and not the date at all. Is there a way to do this? Thanks
Yes you have to do Rails Date Formats – strftime
please see this kink http://www.wetware.co.nz/2009/07/rails-date-formats-strftime/
and this link http://apidock.com/ruby/DateTime/strftime
Example:
// in your case, should do it
Time.now.beginning_of_day.strftime("%I:%M")
With preceding zero (capital I)
Time.now.strftime("%I:%M") # => 05:21
Without preceding zero (lowercase L)
Time.now.strftime("%l:%M") # => 5:21
hope it help.
UPDATE: // well for your question on the comment below Why is Time storing date information? Shouldn't it just be time?
Good point for you question, as vladr mention it above(I am sure that the group of people who create rails will think very similar to vladr mention above when they gonna design Time object in rail), when you are talking about time, it cannot be help that it will related to the time zone too. In your case you want to use Time.now.beginning_of_day right? I am now in Thailand and assume that you are in America, Time in America is slower than Thailand for 12 hours, so right now in Thailand it's Tue 10:00 am, then in America should be Mon 10:00 pm, so you use Time.now.beginning_of_day (let use Time.now for more clear picture), so what should be the answer? so that's why I think the group of people who create rails use UTC+0 as a standard, In your case I recommend using Time.zone.now, and recently I have found some blog with interesting topic, please see http://www.elabs.se/blog/36-working-with-time-zones-in-ruby-on-rails, hope it help. anyone who read this if you feel that something is missing, feel free to edit my answer, thank you very much for you guys :).
Am thinking about a project which might use similar functionality to how "Quick Add" handles parsing natural language into something that can be understood with some level of semantics. I'm interested in understanding this better and wondered what your thoughts were on how this might be implemented.
If you're unfamiliar with what "Quick Add" is, check out Google's KB about it.
6/4/10 Update
Additional research on "Natural Language Parsing" (NLP) yields results which are MUCH broader than what I feel is actually implemented in something like "Quick Add". Given that this feature expects specific types of input rather than the true free-form text, I'm thinking this is a much more narrow implementation of NLP. If anyone could suggest more narrow topic matter that I could research rather than the entire breadth of NLP, it would be greatly appreciated.
That said, I've found a nice collection of resources about NLP including this great FAQ.
I would start by deciding on a standard way to represent all the information I'm interested in: event name, start/end time (and date), guest list, location. For example, I might use an XML notation like this:
<event>
<name>meet Sam</name>
<starttime>16:30 07/06/2010</starttime>
<endtime>17:30 07/06/2010</endtime>
</event>
I'd then aim to build up a corpus of diary entries about dates, annotated with their XML forms. How would I collect the data? Well, if I was Google, I'd probably have all sorts of ways. Since I'm me, I'd probably start by writing down all the ways I could think of to express this sort of stuff, then annotating it by hand. If I could add to this by going through friends' e-mails and whatnot, so much the better.
Now I've got a corpus, it can serve as a set of unit tests. I need to code a parser to fit the tests. The parser should translate a string of natural language into the logical form of my annotation. First, it should split the string into its constituent words. This is is called tokenising, and there is off-the-shelf software available to do it. (For example, see NLTK.) To interpret the words, I would look for patterns in the data: for example, text following 'at' or 'in' should be tagged as a location; 'for X minutes' means I need to add that number of minutes to the start time to get the end time. Statistical methods would probably be overkill here - it's best to create a series of hand-coded rules that express your own knowledge of how to interpret the words, phrases and constructions in this domain.
It would seem that there's really no narrow approach to this problem. I wanted to avoid having to pull along the entirety of NLP to figure out a solution, but I haven't found any alternative. I'll update this if I find a really great solution later.
I have a requirement to handle custom date formats in an existing app. The idea is that the users have to do with multiple formats from outside sources they have very little control over. We will need to be able to take the format and both validate Dates against it, as well as parse strings specifically in that format. The other thing is that these can be completely arbitrary, like JA == January, FE == February, etc...
to my understanding, chronic only handles parsing (and does it in a more magical way then I can use), and enter code here DateTime#strptime comes close, but doesn't really handle the whole two character month scenario, even with custom formatters. The 'nuclear' option is to write in custom support for edge cases like this, but I would prefer to use a library if something like this exists.
I don't think something that handles all these problems exists if the format is really very arbitrary. It would probably be easiest to "mold" your input into a form that can be handled by Date.parse, Date.strptime, or another existing tool, even though that could mean quite a bit of work.
How many different formats are we talking about? Do any of them conflict? It seems like you could just gsub like so: input_string.gsub(/\bJA\b/i, 'January'). Is this part of an import routine, or are the users going to be typing in dates in different formats?
There's a related question here: Parse Italian Date with Ruby
I need to parse recipe ingredients into amount, measurement, item, and description as applicable to the line, such as 1 cup flour, the peel of 2 lemons and 1 cup packed brown sugar etc. What would be the best way of doing this? I am interested in using python for the project so I am assuming using the nltk is the best bet but I am open to other languages.
I actually do this for my website, which is now part of an open source project for others to use.
I wrote a blog post on my techniques, enjoy!
http://blog.kitchenpc.com/2011/07/06/chef-watson/
The New York Times faced this problem when they were parsing their recipe archive. They used an NLP technique called linear-chain condition random field (CRF). This blog post provides a good overview:
"Extracting Structured Data From Recipes Using Conditional Random Fields"
They open-sourced their code, but quickly abandoned it. I maintain the most up-to-date version of it and I wrote a bit about how I modernized it.
If you're looking for a ready-made solution, several companies offer ingredient parsing as a service:
Zestful (full disclosure: I'm the author)
Spoonacular
Edamam
I guess this is a few years out, but I was thinking of doing something similar myself and came across this, so thought I might have a stab at it in case it is useful to anyone else in f
Even though you say you want to parse free test, most recipes have a pretty standard format for their recipe lists: each ingredient is on a separate line, exact sentence structure is rarely all that important. The range of vocab is relatively small as well.
One way might be to check each line for words which might be nouns and words/symbols which express quantities. I think WordNet may help with seeing if a word is likely to be a noun or not, but I've not used it before myself. Alternatively, you could use http://en.wikibooks.org/wiki/Cookbook:Ingredients as a word list, though again, I wouldn't know exactly how comprehensive it is.
The other part is to recognise quantities. These come in a few different forms, but few enough that you could probably create a list of keywords. In particular, make sure you have good error reporting. If the program can't fully parse a line, get it to report back to you what that line is, along with what it has/hasn't recognised so you can adjust your keyword lists accordingly.
Aaanyway, I'm not guaranteeing any of this will work (and it's almost certain not to be 100% reliable) but that's how I'd start to approach the problem
This is an incomplete answer, but you're looking at writing up a free-text parser, which as you know, is non-trivial :)
Some ways to cheat, using knowledge specific to cooking:
Construct lists of words for the "adjectives" and "verbs", and filter against them
measurement units form a closed set, using words and abbreviations like {L., c, cup, t, dash}
instructions -- cut, dice, cook, peel. Things that come after this are almost certain to be ingredients
Remember that you're mostly looking for nouns, and you can take a labeled list of non-nouns (from WordNet, for example) and filter against them.
If you're more ambitious, you can look in the NLTK Book at the chapter on parsers.
Good luck! This sounds like a mostly doable project!
Can you be more specific what your input is? If you just have input like this:
1 cup flour
2 lemon peels
1 cup packed brown sugar
It won't be too hard to parse it without using any NLP at all.