Autoformat Text with Machine Learning - machine-learning

I am currently working at an issue regarding optimizing the workflow of an agency.
The agency receives like 30-40 PDF/Word documents, which should be converted into Indesign-Files, which will be print in newspapers. Its always the same pattern: job adverts with a logo, the job position and some text.
Weekly the same customers send us their adverts. Our employees usually take the patterns of the existing files and copy-paste the new text.
We apply some fix formating rules like: not words overlapping across lines, distance between the job title and the first paragraph. One important thing is to keep the height as small as possible in order to reduce costs for our clients. Because we have many employees who are new, work in part-time etc. we face a huge fluctuation. therefore we want to standardize the process, in order to only do some little changes for new adverts.... I guess you know what I mean.
Do you see a possibilty to improve the process for example with NLTK? I think of training an algorithm which recognizes the "job title", "bullet-points", logo etc. and automatically propose a formation for the text.
A colleague told me just to write a script which formates the indesign document.
What do you think? Thanks so far.
Here is a brief example:
Example Picture

Related

How to generate question based on information given in question or by combining multiple questions?

I have a dataset of question and I want to generate the questions from the given text. For example, if I have a task like "John paid 1500$ for television set. Television set has 25% discount. Calculate the original price of television set." and I want to generate something like "Calculate the discount given for the television set"
Or alternative version is to permutate multiple question. I have multiple questions with the same information given, but different questions asked (like all three questions share information about in which ratio the object was shared, but different questions asked)
I was looking for the text paraphrase task but it don't work for me, because I want to generate new question, rather than paraphrasing existing.
Also, I was thinking about storing the questions in the format of knowledge graph and somehow extract the text from it for text generation
It will be great to know is there were similar problem in the domain and how the problem like this called.

Is there a way to have a formula or script pick an amount of pre-set lengths to cover an area

Apologies if the title isn't very clear.
What I am trying to do is get a google sheet to automatically calculate how many lengths of a material I will need to cover an area, hopefully to include a mix if needed. There are three different lengths of material that never change, but the total area I need to cover changes on a case by case basis. It is only a straight line so there is no need to worry about width or height.
The data breaks down as follows:
Pre-set lengths to choose from
10'6"
12'6"
14'6"
Length of area I need to cover only comes in inches (ie. 68 1/2"; 70"; 59")
The only thing I have been successful in doing is getting the length I need to cover and then manually picking out how many pieces of each length I need, but I cannot think of any way for me to have a formula or script optimize how many of each piece I need. I can understand formulas well enough, but once trying to script anything comes into play I start getting lost. I believe this issue may be beyond the capabilities of formulas.
This is an interesting problem - I don't have the 'reputation' required to comment, but to be clear: you're actually trying to find the 'best fit' of the available lengths to cover the required length?
If that's the case then yes, you're not going to get there without scripting. Fortunately, there are other folks who have this problem and have solved it... you could look at this online cut-list calculator for an example. I think that one even includes an embeddable script for your sheets.
If you're looking to solve the problem yourself because it's interesting, googling 'optimal cut list' or the like will turn up references. Usually you're optimizing on two variables (e.g. 'fewest joins' and 'least waste'), which tips you over into the world of linear programming (only just...) if you want to go there. If it were me, I'd just dig up a few example scripts and map how they operate back to a theoretical description (e.g. this wiki article.)

Text bleeding out of page in knitr using latex

I am using the following code snippet to write some text data frame as a table.
temp<-c("A white paper is a document that describes a given problem and proposes a specific solution to the problem.", "Originally used to describe government policy, white papers are most common today in corporate settings.", "A typical white paper might list ways to meet a client's marketing needs, suggest the use of a certain product for a technical process, or identify ways to streamline internal communication.")
myTable<-as.data.frame(temp)
myTable<-print.xtable(xtable(textData,caption="Some Text Data"),caption.placement="top",print.results="F", tabular.environment="tabularx", width="\\textwidth")
The table border is limited to the page, but the text still bleeds out. How do I get the text to come within the table limits?

Profanity checking for promotional codes

I have a slightly unusual profanity-related question.
Now we're used to dealing with profanity-filtering of user-generated content — any method is imperfect, but products like CleanSpeak and WebPurify do a good-enough job.
The problem we have at the moment, though, is that we've been building an engine to run promotional-code–based competitions, that will be used internationally. We could do with checking that none of these codes is profane in Latin American Spanish or Malay (at least in the first instance), to make sure we don't send out a code that's equivalent to FUCK23 or PEN15 or something.
We've tried Googling around and asking people we know, but we can't find an easy way of getting hold of an es-419 or an ms profanity list to filter the codes against. As there are literally millions of codes per locale, we'd rather do an offline check than hit an API for each code (which would be expensive both in terms of bandwidth and usage fees).
I know this is a bit of a long shot, but does anyone know of a good source for profanity lists in different languages?
#disclaim: We know that no profanity filtering is perfect, that it's essentially futile with user-generated content and we have read SO #273516: How do you implement a good profanity filter? — that's not what we're asking.
Building or finding lists in other languages is extremely time consuming and difficult (trust me, we've built many of them at Inversoft). You might be better off tweaking the code generators instead (from what I could tell your code is generating the promotional codes rather than humans).
The best way to tweak a generator is to ensure that the codes can't easily form words based on the general use of consonants and vowels in most European languages. Things get a bit dicey in Polish and others, but it usually works.
Generally, most codes that start with a vowel are followed by another vowel or a non-joining consonant (like 'q' without a 'u'). If the code starts with a consonant then the next character is the same consonant or one that has a low probability of being used. For example, if you start with 's' then adding 'g' is a good choice.
You could also use wiktionary or other similar sources (like Linux dictionary files) to build a statistical approach to this. By extracting the probability of characters being next to each other, you should be able to generate codes with good accuracy of never being words in any language.
However, if I misread your question and you aren't generating the codes programmatically, you can ignore my response completely. :)
I have had the same thoughts. in trying to generate 6 character codes for a project i am doing.
I decided to reduce the likelyhood of obvious porfain codes So i removed the vowels that i found in as many "bad" words as i could think of, from my intial base 36 generation code. Leaving me with something more like a base 28 system that did not include a,e,i,o,u, 1,0. the one and zero were removed to reduce confusion between those characters in some fonts with I,L,O's
so far I have not seen a "profain" code genreated. Although base 28 has 1.something billion unique combinations.
i cannot vouch for other languages, and had not even considered it...

NLP for extracting actions from text

I'm hoping somebody can point me in the right direction to learn about separating out actions from a bunch of text.
Suppose I have this text
Drop off the dry cleaning, and go to the corner store and pick-up a jug of milk and get a pint of strawberries.
Then, go pick up the kids from school. First, get John who is in the daycare next to the library, and then get Sam who is two blocks away.
By the time you've got the kids, you'll need to stop by the doctors office for the perscription. Tim's flight arrives at 4pm.
It's American Airlines flight 331 arriving from Dallas. It will be getting close to rush hour, so make sure you leave yourself enough time.
I'm trying to have it split up into
Drop off the dry cleaning,
and go to the corner store and pick-up a jug of milk and get a pint of strawberries.
Then, go pick up the kids from school. First, get John who is in the daycare next to the library, and then get Sam who is two blocks away.
By the time you've got the kids, you'll need to stop by the doctors office for the perscription.
Tim's flight arrives at 4pm.
It's American Airlines flight 331 arriving from Dallas. It will be getting close to rush hour, so make sure you leave yourself enough time.
I haven't been able to find anything in my searches that is specifically action based. It would need to be smarter than just picking out verbs, as there are multiple verbs that are sometimes associated with one action for, instance the second item has 'go','pick-up' and 'get', but that is all part of a single action. Of course, "Tim's flight" is only suggests an action with the present participle, with the verb coming toward the end of the segment.
Any suggestions on where to look to do this kind of thing? Things to watch-out for, recommended readings, etc. etc.
Simple approach: parse the text using [your favorite parser], then select the sentences or SBAR phrases that are in the imperative mood. The Stanford Parser just so happens to have "Improved recognition of imperatives" in its very latest release.
There's probably no need for machine learning beyond what is already incorporated in standard parser programs.
This domain is called Information Extraction.
The general approach to sentence understanding is either:
extract a Part-Of-Speech tagged parse-tree (Python spaCy.io, nltk, CoreNLP etc.)
extract a word-vector (e.g. word2vec)

Resources