Binary And Alternate Representation Transforming - virus

In this publication about Metamorphic viruses i have found this classification:
Metamorphic malware may be either a binary-transformer or an alternate-representation-transformer. The former class transforms the binary image that is executed,
whereas the latter class carries its code in a higher level representation, which is used for
transformation.
I did not found a precise definition of these two terms. I would like to know if is there a generic definition for each one, or a generic context to introduce the classification in my dissertation.
Thanks all.

More common term for Binary-Transformer is Binary Code Obfuscation or simple Binary Obfuscation (plays an essential role in evading malware static analysis and detection). Some anthers also use term Post-compilation obfuscation[*]. Term Binary Obfuscation also used in reverse engineering for innocent purpose (to recover source file).[1][2][3]
Whereas, for Alternate-Representation-Transformer term Assembly Level Obfuscation, Source Code Obfuscation( or Source Obfuscation) you can use Mnemonic Level Obfuscation, Code Obfuscation.
Read this sort article to find useful common terms.
(but I am not sure for Post-compilation obfuscation)
Paper Writing is not exact science. Different authors use different(rare) words to prevent probability of match. Many time my papers/journal rejected only due to presentation, but not because of technical flaw.

Related

Question pairs (ground truth) datasets for Word2Vec model testing?

I'm looking for test datasets to optimize my Word2Vec model. I have found a good one from gensim:
gensim/test/test_data/questions-words.txt
Does anyone know other similar datasets?
Thank you!
It is important to note that there isn't really a "ground truth" for word-vectors. There are interesting tasks you can do with them, and some arrangements of word-vectors will be better on a specific tasks than others.
But also, the word-vectors that are best on one task – such as analogy-solving in the style of the questions-words.txt problems – might not be best on another important task – like say modeling texts for classification or info-retrieval.
That said, you can make your own test data in the same format as questions-words.txt. Google's original word2vec.c release, which also included a tool for statistically combining nearby words into multi-word phrases, also included a questions-phrases.txt file, in the same format, that can be used to test word-vectors that have been similarly constructed for 'words' that are actually short multiple-word phrases.
The Python gensim word-vectors support includes an extra method, evaluate_word_pairs() for checking word-vectors not on analogy-solving but on conformance to collections of human-determined word-similarity-rankings. The documentation for that method includes a link to an appropriate test-set for that method, SimLex-999, and you may be able to find other test sets of the same format elsewhere.
But, again, none of these should be considered the absolute test of word-vectors' overall quality. The best test, for your particular project's use of word-vectors, would be some repeatable domain-specific evaluation score you devise yourself, that's inherently correlated to your end goals.

Scikit - How to get single term for similar words using sklearn

I'm new to text analysis and scikit-learn. I am trying to vectorize tweets using sklearn's TfidfVectorizer class. When I listed the terms using 'get_feature_names()' after vactorizing the tweets, I see similar words such as 'goal', 'gooooal' or 'goaaaaaal' as different terms.
Question is, How can I make a single term 'goal' for such similar but different words using sklearn feature extraction techniques (or any other techniques) to get my results better?
In short - you can't. This is a very complex problem, going to the whole language understanding. Think for a moment - can you define exactly what does it mean to be "similar but different"? If you can't, computer will not be able to, too. What you can do?
You can come up with easy preprocessing rules, such as "remove any repeating letters", which will fix the "goal" problem. (this should not lead to any further problems)
You can use existing databases of synonyms (like wordnet) to "merge" the same meaning to the same tokens (this might cause false positive - you might "merge" words of different meaning due to the lack of context analysis)
You can build some language model and use it to embed your data in a lower-dimensional space forcing your model to merge similar meanings (using the well known heuristics "words that occur in similar contexts have similar meaning"). One of such technique is Latent Semantic Analysis but obviously there are many more possible.

Cutting down on Stanford parser's time-to-parse by pruning the sentence

We are already aware that the parsing time of Stanford Parser increases as the length of a sentence increases. I am interested in finding creative ways in which we prune the sentence such that the parsing time decreases without compromising on accuracy. For e.g. we can replace known noun phrases with one word nouns. Similarly can there be some other smart ways of guessing a subtree before hand, let's say, using the POS Tag information? We have a huge corpus of unstructured text at our disposal. So we wish to learn some common patterns that can ultimately reduce the parsing time. Also some references to publicly available literature in this regards will also be highly appreciated.
P.S. We already are aware of how to multi-thread using Stanford Parser, so we are not looking for answers from that point of view.
You asked for 'creative' approaches - the Cell Closure pruning method might be worth a look. See the series of publications by Brian Roark, Kristy Hollingshead, and Nathan Bodenstab. Papers: 1 2 3. The basic intuition is:
Each cell in the CYK parse chart 'covers' a certain span (e.g. the first 4 words of the sentence, or words 13-18, etc.)
Some words - particularly in certain contexts - are very unlikely to begin a multi-word syntactic constituent; others are similarly unlikely to end a constituent. For example, the word 'the' almost always precedes a noun phrase, and it's almost inconceivable that it would end a constituent.
If we can train a machine-learned classifier to identify such words with very high precision, we can thereby identify cells which would only participate in parses placing said words in highly improbable syntactic positions. (Note that this classifier might make use of a linear-time POS tagger, or other high-speed preprocessing steps.)
By 'closing' these cells, we can reduce both the the asymptotic and average-case complexities considerably - in theory, from cubic complexity all the way to linear; practically, we can achieve approximately n^1.5 without loss of accuracy.
In many cases, this pruning actually increases accuracy slightly vs. an exhaustive search, because the classifier can incorporate information that isn't available to the PCFG. Note that this is a simple, but very effective form of coarse-to-fine pruning, with a single coarse stage (as compared to the 7-stage CTF approach in the Berkeley Parser).
To my knowledge, the Stanford Parser doesn't currently implement this pruning technique; I suspect you'd find it quite effective.
Shameless plug
The BUBS Parser implements this approach, as well as a few other optimizations, and thus achieves throughput of around 2500-5000 words per second, usually with accuracy at least equal to that I've measured with the Stanford Parser. Obviously, if you're using the rest of the Stanford pipeline, the built-in parser is already well integrated and convenient. But if you need improved speed, BUBS might be worth a look, and it does include some example code to aid in embedding the engine in a larger system.
Memoizing Common Substrings
Regarding your thoughts on pre-analyzing known noun phrases or other frequently-observed sequences with consistent structure: I did some evaluation of a similar idea a few years ago (in the context of sharing common substructures across a large corpus, when parsing on a massively parallel architecture). The preliminary results weren't encouraging.In the corpora we looked at, there just weren't enough repeated substrings of substantial length to make it worthwhile. And the aforementioned cell closure methods usually make those substrings really cheap to parse anyway.
However, if your target domains involved a lot of repetition, you might come to a different conclusion (maybe it would be effective on legal documents with lots of copy-and-paste boilerplate? Or news stories that are repeated from various sources or re-published with edits?)

What are some good resources to learn about automatic, learning based document summarization?

Document summarization can be done by text extraction from the source document or you can employ learning algorithms to decipher what's conveyed by the document, and then generate the summary using language generation techniques (much like a human does).
Are there algorithms or existing research work for the latter method? In general, what are some good resources to learn about document summarization techniques?
The topic you are looking for is called Automatic Summarization in computer science community.
Automatic summarization is the process of reducing a text document with a computer program in order to create a summary that retains the most important points of the original document.
Methods of automatic summarization include extraction-based, abstraction-based, maximum entropy-based, and aided summarization.
Here is a good survey paper on this topic. You might want to take a look at two other papers: 1 and 2 as well.
Hope it helps.
Automatic text summarisation is generally of two types: Abstractive and Extractive. Abstractive approach is a bit complicated than Extractive. In the first one, important features, key information from sentences are extracted. Using natural language generation techniques, new sentences are generated using those features.
However, in later approach, all the sentences are ranked using methods like Lexical Ranking, lexical chaining etc. The similar sentences are clustered using approaches like cosine similarity, fuzzy matching etc. The most important sentences of the clusters are used to generate a summary of given document.
Some existing Automatic document text summarisation work and techniques compiled from various sources:
Semantria method of lexical chaining
MEAD
http://dl.acm.org/citation.cfm?id=81789
https://www.cs.cmu.edu/~afm/Home_files/Das_Martins_survey_summarization.pdf
http://www.upf.edu/pdi/iula/iria.dacunha/docums/cicling2013LNCS.pdf

The options for the first step of document clustering

I checked several document clustering algorithms, such as LSA, pLSA, LDA, etc. It seems they all require to represent the documents to be clustered as a document-word matrix, where the rows stand for document and the columns stand for words appearing in the document. And the matrix is often very sparse.
I am wondering, is there any other options to represent documents besides using the document-word matrix? Because I believe the way we express a problem has a significant influence on how well we can solve it.
As #ffriend pointed out, you cannot really avoid using the term-document-matrix (TDM) paradigm. Clustering methods operates on points in a vector space, and this is exactly what the TDM encodes. However, within that conceptual framework there are many things you can do to improve the quality of the TDM:
feature selection and re-weighting attempt to remove or weight down features (words) that do not contribute useful information (in the sense that your chosen algorithm does just as well or better without these features, or if their counts are decremented). You might want to read more about Mutual Information (and its many variants) and TF-IDF.
dimensionality reduction is about encoding the information as accurately as possible in the TDM using less columns. Singular Value Decomposition (the basis of LSA) and Non-Negative Tensor Factorisation are popular in the NLP community. A desirable side effect is that the TDM becomes considerably less sparse.
feature engineering attempts to build a TDM where the choice of columns is motivated by linguistic knowledge. For instance, you may want to use bigrams instead of words, or only use nouns (requires a part-of-speech tagger), or only use nouns with their associated adjectival modifier (e.g. big cat, requires a dependency parser). This is a very empirical line of work and involves a lot of experimentation, but often yield improved results.
the distributional hypothesis makes if possible to get a vector representing the meaning of each word in a document. There has been work on trying to build up a representation of an entire document from the representations of the words it contains (composition). Here is a shameless link to my own post describing the idea.
There is a massive body of work on formal and logical semantics that I am not intimately familiar with. A document can be encoded as a set of predicates instead of a set of words, i.e. the columns of the TDM can be predicates. In that framework you can do inference and composition, but lexical semantics (the meaning if individual words) is hard to deal with.
For a really detailed overview, I recommend Turney and Pantel's "From Frequency to Meaning : Vector Space Models of Semantics".
You question says you want document clustering, not term clustering or dimensionality reduction. Therefore I'd suggest you steer clear of the LSA family of methods, since they're a preprocessing step.
Define a feature-based representation of your documents (which can be, or include, term counts but needn't be), and then apply a standard clustering method. I'd suggest starting with k-means as it's extremely easy and there are many, many implementations of it.
OK, this is quite a very general question, and many answers are possible, none is definitive
because it's an ongoing research area. So far, the answers I have read mainly concern so-called "Vector-Space models", and your question is termed in a way that suggests such "statistical" approaches. Yet, if you want to avoid manipulating explicit term-document matrices, you might want to have a closer look at the Bayesian paradigm, which relies on
the same distributional hypothesis, but exploits a different theoretical framework: you don't manipulate any more raw distances, but rather probability distributions and, which is the most important, you can do inference based on them.
You mentioned LDA, I guess you mean Latent Dirichlet Allocation, which is the most well-known such Bayesian model to do document clustering. It is an alternative paradigm to vector space models, and a winning one: it has been proven to give very good results, which justifies its current success. Of course, one can argue that you still use kinds of term-document matrices through the multinomial parameters, but it's clearly not the most important aspect, and Bayesian researchers do rarely (if ever) use this term.
Because of its success, there are many software that implements LDA on the net. Here is one, but there are many others:
http://jgibblda.sourceforge.net/

Resources