Find similar files from repository - machine-learning

I have a repository of files. The files are in plain English text created by humans. Each file contains few paragraphs describing some incident.
Now, since each person is different, two or more incidents can be written in different wordings, having different grammar. Even a same person may tend to write about an incident in different words with different grammar.
How can I find and cluster similar files together?

There are various approaches. You can try Clustering text documents using k-means. See also the discussion here.

Related

I need a service for generation sentence included a set of words

I need to generate random but lexically correct sentences with my words included. The sentence doesn't have to consist entirely of my words. But the more of them the better.
I have already searched through a lot of resources related to machine learning, but everywhere they write about generating RANDOM texts. I can't influence the result in any way by specifying the presence of at least a couple of my words as a condition.
Perhaps someone on this resource knows about repositories with similar libraries or APIs where I can get something like this?
After a week of searching, I can say that there is no ready-to-use library or service for this.
If you data scientist, you may found this problem interesting for a creation your own startup, i think.

Single YAML file VS multiple YAML files in different folders and subfolders

I am working on automating the translation workflow and improving the Localization process as a whole of a Rails website. I am using SimpleBackend so only YAML files are used for storing translations.
The current locales directory consists of folders, then sub-folders (in some cases) and those sub-folders containing yml files. I am considering to integrate the project with some third-party tool like Transifex for translation management so may be using a single YAML file for each language may be good for management of workflow.
If someone can highlight the pros and cons of both structures then it would be really helpful to decide whether I should switch from nested file structure to single file pattern or not. Also, the project is an Open-Source project with active contributors and so thinking for a long-term solution.
Thanks!
I think whatever tools you are using to make the process flow smoothly factors a lot in this decision. You should explore how exactly Transifex wants things to be structured in output, and try to keep your current input structure, and give that a shot before making a decision.
However, in my opinion, for a large app with a lot of translatable text, my preference would be to allow for multiple yaml files in your default locale, and one or two consolidated yaml files for each foreign translation. If there isn't a lot of translatable text in your app, maybe a single file is fine for you, but given it's already split up, there's a good chance that's the better choice. On a team with many contributors you can end up with a very high churn file (maybe with a lot of merge conflicts) that everyone changes all the time.
Splitting into separate files lets you logically separate out text to match a domain in your app, like a separate yaml file for mailers (or even each mailer), and one for each domain (or controller). Either way, it puts you in control of your organization strategy.
However, there isn't a lot of value, IMO in separating your foreign translations to mirror that structure. The systems I have experience with (not Transifex) generate your foreign translation files for you, so you just need to sync with the web interface and commit the results.

How to extract entities from html using natural language processing or other technique

I am trying to parse entities from web pages that contain a time, a place, and a name. I read a little about natural language processing, and entity extraction, but I am not sure if I am heading down the wrong path, so I am asking here.
I haven't started implementing anything yet, so if certain open source libraries are only suitable for a specific language, that is ok.
A lot of times the data would not be found in sentences, but instead in html structures like lists (e.g. 2013-02-01 - Name of Event - Arena Name).
The structure of the webpages will be vastly different (some might use lists, some might put them in a table, etc.).
What topics can I research to learn more about how to achieve this?
Are there any open source libraries that take into account the structure of html when doing entity extraction?
Would extracting these (name, time, place) entities from html be better (or even possible) with machine vision where the CSS styling might make it easier to differentiate important parts (name, time, location) of the unstructured text?
Any guidance on topics/open source projects that I can research would help I think.
Many programming languages have external libraries that generate canonical date-stamps from various formats (e.g. in Java, using the SimpleDateFormat). As you say, the structure of the web-pages will be vastly different, but date can be expressed using a small number of variations only, so writing down the regular expressiongs for a few (let's say, half-a-dozen) formats will enable extraction of dates from most, if not all, HTML pages.
Extraction of places and names is harder, however. This is where natural language processing will have to come in. What you are looking for is a Named Entity Recognition system. One of the best open source NER systems is the Standford NER. Before using, you should check out their online demo. The demo has three classifiers (for English) that you can choose from. For most of my tasks, I find their english.all.3class.distsim classifier to be quite accurate.
Note that an NER performs well when the places and names you extract are occurring in sentences. If they are going to occur in HTML labels, this approach is probably not going to be very helpful.

Opening a PDF file and searching for names there

I have a PDF file. And I want to search for names there.
How can I open the PDF and get all its text with Ruby?
Are there are any algorithms to find names?
What should I use as a search engine: Sphinx or something simpler (just LIKE sql queries)?
To find proper names in unstructured text, the technical name for the problem you are trying to solve is Named Entity Recognition or Named Entity Extraction. There are a number of different natural language toolkits and research papers which implement various algorithms to try to solve this problem. None of them will get perfect accuracy, but it may be good enough for your needs. I haven't tried it myself but the web page for Stanford Named Entity Recognizer has a link for Ruby Bindings.
Tough question. These domains remain in the research area of semantic web. I can only suggest some tracks but would be curious to know your definite choice.
I'd use pdf-reader: https://github.com/yob/pdf-reader
You could use a Bloom Filter matching some dictionary. You'd assume that words not matching the dictionary are names... Not always realistic but it's a first approach.
To get more names, you could check the words beginning with a capital letter (not great but we keep on finding some basic approaches). Some potential resource: http://snippets.dzone.com/posts/show/4235
For your search engine, the two main choices using Rails are Sphinx and SolR.
Hope this helps!

Tool to parse text for possible Wikipedia links

Does a tool exist that can parse text and output that text, hyper-linked to Wikipedia entries for words of interest?
For example, I'd like a tool that could turn something like:
The most popular search algorithm on a
sorted list is the binary search.
Into:
The most popular search algorithm on a
sorted list is the binary search.
It would be wonderful if Wikipedia had an API which would do this since they would be best equipped to determine what "words of interests" are.
In my example I simply linked all combinations which linked directly to an entry except for The and most.
There is a tool that does exactly what you're asking for.
http: //wikify.appointment.at/
It's not perfect, but it works.
You have two separate problems to solve here:
Deciding which words should be linked
Determining if there's a suitable entry to link these words to
Now, (2) is simpler, though it's also somewhat problematic. Wikipedia seems to have an API that allows you to gather data efficiently, and they also allow "screen scraping". But there's a problem with disambiguation - sometimes you might hit not the entry you wanted. For example, python links to a disambiguation page, as it can be a programming language, a snake and a couple of other things.
(1) Is much harder, though. You can take the "simple approach" and attempt to find links for all non-trivial nouns (or even noun/adjective pairs). Non-trivial here means omitting words like "fiend, word, computer" etc.
But This would result in a plethora of links, which isn't convenient to read. It's really up to you to decide what's interesting in the text, and this depends a lot on the text itself. In an article for professional programmers, do you really want to link to "search algorithm" every time? But for beginners, perhaps you do.
To conclude, I strongly doubt there's a single general-purpose tool that will do the trick for you. But you surely have all the options at your hand, and something need-specific can be coded without too much effort.
Silviu Cucerzan of Microsoft Research tackled this problem. Well, not the problem of inserting the links, but the general issue of determining what entities are being mentioned in a some piece of text. Fortunately for you, he used Wikipedia articles as his set of entities. His paper, "Large-Scale Named Entity Disambiguation Based on Wikipedia Data", is available on his website. Direct link: pdf.

Resources