Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am working on a project where I need to extract "technology related keywords/keyphrases" from text. For example, my text is:
"ABC Inc has been working on a project related to machine learning which makes use of the existing libraries for finding information from big data."
The extracted keywords/keyphrase should be: {machine learning, big data}.
My text documents are stored as BSON documents in MongoDb.
What are the best nlp libraries(with sufficient documentation and examples) out there to perform this task and how?
Thanks!
It looks you need to narrow down more than just keywords/key phrases and find the subject and object per sentence.
For subject/object recognition, I recommend the Stanford Parser or the Google Language API, where you send a string and get a dependency tree response.
You can test the Google API first to see if it works well with your corpus: https://cloud.google.com/natural-language/
The outcome here is a subject predicate object (SPO) triplet, where your predicate describes the relationship. You'll need to traverse the dependency graph and write a script to parse out the triplet.
Other Packages:
I use NLTK, Spacy, and Textblob frequently. If the corpus is simple, generic, and straightforward, Spacy and Textblob work well OOTB. If the corpus is highly customized, domain-specific, messy (incorrect spelling or grammar), etc. I'll use NLTK and spend more time customizing my NLP text processing pipeline with scrubbing, lemmatizing, etc. You may want to add your own custom dictionary of technology related keywords and keyphrases so that your parser can catch these if you decide to go with one of these packages.
NLTK Tutorial: http://www.nltk.org/book/
Spacy Quickstart: https://spacy.io/usage/
Textblob Quickstart: http://textblob.readthedocs.io/en/dev/quickstart.html
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
Any tools could help recognize the data distribution pattern, and then make the decision to choose ML algorithms?
Firstly, you have to understand Machine Learning as a field, and have some understanding of its sub fields. If you don't intuitively understand your tools, you won't be able to identify when to use them.
The idea you're talking about is called exploratory data analysis, and it can be very approachable if you think about it the right way. Think about it in terms of the scientific method:
First, look over the data, and any documentation about it.
Then, come to some hypotheses about the patterns that might exist.
Based on your understanding of ML, brainstorm some approaches that might give some insight into your hypotheses. For example, if you see that your proposed dependent value can have several distinct values, you have a classification problem, and based on your input data, you should choose an appropriate approach.
The tools that you might find useful are plentiful, but a good start could be the programming language R, or Python. Both are very strong data science tools. R has a greater learning curve, but is built with data science in mind. Python, on the other hand, is very easy to pick up, but you have more choices to make with regards to ML and data science libraries. With Python, look into Pandas for CSV and data manipulation, and Tensorflow, Theano or Scikit-Learn for data analysis and ML.
Hope this helps!
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
Looking for APIs, methods, research, etc on the subject of deciding whether a tweet (a string, really) conveys a mood of danger.
For example:
Danger: "this house across the street is on fire!!
Not danger: "this girl is on fire! love this song"
There is little research done on the particular problem of detecting danger, but there are a few research papers describing methods to detect natural hazards. Your example is reminiscent of the title of one of them: Finding Fires with Twitter. Another research that you may find useful is Emergency Situation Awareness: Twitter Case Studies.
In general, however, the best approach to solve such a problem is through supervised classification, very similar to how sentiment analysis is (or rather, was, because there are more sophisticated machine learning paradigms like Deep Learning being applied nowadays) done.
The essence is to label documents (in your case, tweets) into "danger" and "not danger". This labeling is done by human experts. Ideally, they should be well versed in the language and the domain. So, using native English speakers who know the colloquialisms of Twitter would be perfect annotators for this task.
Once adequate number of documents have been labeled, the baseline (i.e. the basic approach) is usually achieved by creating n-gram word vectors as feature vectors, and running SVM. If you are not aware of machine learning details, please read up on them before doing this.
Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 7 years ago.
Improve this question
Using data mining, we are able to find useful patterns in a large set of data using techniques like correlation etc etc and there must exist some open source tools for this (what are some examples?).
Is this pull-based or push-based? I mean, do we provide data set as well as specific queries as input to the data mining engine and it provides us answers (as in SQL) or we only supply large data set as input to the engine and it on its own find patterns (which we never knew existed and/or we couldn't formulate queries for this) and thus we don't really pull any specific queries from it, it pushes the patterns to us.
Some quick reading of Wikipedia article doesn't clarify my doubts in clear way.
As open source have a look at Weka.
In regards to the push-pull thing, well, it's a bit of both. But it's not quite that simple. You must be looking for something. E.g. if you are looking for clusters, there are unsupervised algorithms which will give you an answer with minimal guidance.
In practice things are more meaningful if you know about the data you analyse and you are looking at regularities and patterns that make sense.
Playing with Weka will give you a better idea of the range of possibilities.
Python and R are other great open source tools that have great popularity in the data mining area.
A great tool that i used recently is scikit-learn
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am currently designing a low level network serialization protocol (in fact, a refinement of an existing protocol).
As the work progress, pen and paper documents start to show their limits: i have tons of papers, new and outdated merged together, etc... And i can't show anything to anyone since i describe the protocol using my own notation (a mix of flow chart & C structures).
I need a software that would help me to design a network protocol. I should be able to create structures, fields, their sizes, their layout, etc... and the software would generate some nice UMLish diagrams.
Sorry to say, everything I've seen so far (various serial protocols for embedded devices/networks) has used Word documents, with plain old tables showing allocations of fields to the bytes in the message. Alternatively, I've seen it done in Excel documents! It works, and people can read it.
Unfortunately, that's not helpful for automatic code generation, unless you have a very strict format in e.g. an Excel doc that you can then parse with a tool to generate some code. It would be good to have a notation that can be easily machine parsed, as well as human readable.
For showing message handshaking and sequences, a UML sequence diagram is good of course. There are lots of tools readily available to help you with that part of it.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
There are lots of parsers and lexers for scripts (i.e. structured computer languages). But I'm looking for one which can break a (almost) non-structured text document into larger sections e.g. chapters, paragraphs, etc.
It's relatively easy for a person to identify them: where the Table of Contents, acknowledgements, or where the main body starts and it is possible to build rule based systems to identify some of these (such as paragraphs).
I don't expect it to be perfect, but does any one know of such a broad 'block based' lexer / parser? Or could you point me in the direction of literature which may help?
Many lightweight markup languages like markdown (which incidentally SO uses), reStructured text and (arguably) POD are similar to what you're talking about. They have minimal syntax and break input down into parseable syntactic pieces. You might be able to get some information by reading about their implementations.
Define the annotation standard, which indicates how you would like to break things up.
Go on to Amazon Mechanical Turk and ask people to label 10K documents using your annotation standard.
Train a CRF (which is like an HMM, but better) on this training data.
If you actually want to go this route, I can elaborate on the details. But this will be a lot of work.
Most of the lex/yacc kind of programs work with a well defined grammar. if you can define your grammar in terms of a BNF like format (which most of the parsers accept similar syntax) then you can use any of them. That may be stating the obvious. However you can still be a little fuzzy around the 'blocks' (tokens) of text which would be part of your grammar. After all you define the rules for your tokens.
I have used Parse-RecDescent Perl module in the past with varying levels of success for similar projects.
Sorry, it may not be a good answer but more sharing my experiences on similar projects.
try: pygments, geshi, or prettify
They can handle just about anything you throw at them and are very forgiving of errors in your grammar as well as your documents.
References:
gitorius uses prettify,
github uses pygments,
rosettacode uses geshi,