convert sequence file to vector - mahout

I am trying to implement the naïve bayes algorithm to do sentiment analysis on tweet and facebook data in mahout. I have those tweets and facebook data in a text file. I am converting those files in to sequence file using the command
bin/mahout seqdirectory -i /user/hadoopUser/sample/input -o /user/hadoopUser/sample/seqoutput
and then I tried converting the sequence file in to vector, in order to give input to mahout using the command
bin/mahout seq2sparse -i /user/hadoopUser/sample/seqoutput -o /user/hadoopUser/vectoroutput -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 200 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq
This is converting the whole document in to vectors, but I want to convert each sentence in to vectors not as a whole because I don't want to classify the document. I want to classify the comments in the documents. Could anyone help me to solve this problem?

What you should have is a CSV file with tweets data right? I'm dealing with this exact same problem. What I did (I'm not sure if it worked as I don't even know how to interpret the clustering output, it's just a mess of numbers and words) I wrote each column of my CSV file into the sequence file using Mahout's SequenceWriter class. Then used seq2sparse like normal on that sequence file.

I am not 100% sure, but the main problem is that mahout sees this file like one key/value.
You need to add additional id, for example, md5 hash for each line.
So the CSV format will be:
positive bf9373d6d85959ec755eb8ac5ba0ae77 This movie is a real masterpiece

Related

Text Classification/Document Classification with Sequence Tagging with Mallet

I have documents arranged in folders as classes called categories. For a new input (such as a question asked), I have to identify its category. What is be the best way to do this using MALLET? I've gone through multiple articles about this, but couldn't find such a way.
Also, do I need to do sequence tagging on the input text?
First, you need to develop a training model from the documents arranged as folders. For Mallet, each folder will contain one or more documents and each folder will represent their class.
Once you have your training document, you need to create a file that can be understood by Mallet. Go to the bin folder of Mallet and enter commands like the following in the command line--
mallet import-dir --input directory:\...\parentfolder\* --preserve-case --remove-stopwords --binary-features --gram-sizes 1 --output directory:\mallet-file-name
This is just an example. The parameters in this query can be fully displayed if you type the following--
mallet import-dir --help
Once you create this Mallet file, you need to train a model by putting a command such as the following--
mallet train-classifier --trainer algorithmname --input directory:\mallet-file-name --output-classifier directory:...\model
Now that the model is created, you can use that model to classify a document with unknown class.
mallet classify-file --input directory:\...\data --output - --classifier classifier
This will provide the class of the document named data on the standard output.
If you need to use sequence tagging or not depends on the data that you are trying to classify.

Working with Amino Acids

I am working with a file that contains thousands of proteins in an organism. I have code that will allow me to go through each individual protein one by one and determine the frequency of amino acids in each. Would there be a way to alter my current code to allow me to determine all of the frequencies of amino acids at once?
IIUC, you're reinventing the wheel a bit: BioPython contains utilities for handling files in various formats (FASTA in your case), and simple analysis. For your example, I'd use something like this:
from Bio import SeqIO
from Bio.SeqUtils.ProtParam import ProteinAnalysis
for seq_record in SeqIO.parse("protein_x.txt", "fasta"):
print(seq_record.id), ProteinAnalysis(repr(seq_record.seq)).get_amino_acids_percent().items()
The answer is yes, but without showing us your code we can't give much feedback. Essentially you want to keep your counts of the amino acids persist between reading FASTA records. If you wanted probabilities you then total them up outside the loop and divide through only at the end. This is trivially accomplished without something like a "counting dictionary" in Python or incrementing a value in a hash/dict. There are also highly likely plenty of command line tools that do this for you since all you want is character level counts for any line not starting with a '>' in the file.
For example for a file that small:
grep -v '^>' yourdata.fa | perl -pe 's/(.)/$1\n/g' | sort | uniq -c

Stanford NLP - Using Parsed or Tagged text to generate Full XML

I'm trying to extract data from the PennTreeBank, Wall Street Journal corpus. Most of it already has the parse trees, but some of the data is only tagged.
i.e. wsj_DDXX.mrg and wsj_DDXX.pos files.
I would like to use the already parsed trees and tagged data in these files so as not to use the parser and taggers within CoreNLP, but I still want the output file format that CoreNLP gives; namely, the XML file that contains the dependencies, entity coreference, and the parse tree and tagged data.
I've read many of the java docs but I cannot figure out how to get it the way I described.
For POS, I tried using the LexicalizedParser and it allows me to use the tags, but I can only generate an XML file with the some of the information I want; there is no option for coreference or generating the parse trees. To get it to correctly generate the sub-optimal XML files here, I had to write a script to get rid of all of the brackets within the files. This is the command I use:
java -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -outputFormat typedDependenciesCollapsed,wordsAndTags -outputFilesExtension xml -outputFormatOptions xml -writeOutputFiles -outputFilesDirectory my\dir -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz my\wsj\files\dir
I also can't generate the data I would like to have for the WSJ data that already has the trees. I tried using what is said here and I looked at the corresponding Javadocs. I used the command similar to what is described. But I had to write a python program to retrieve the stdout data resulting from analyzing each file and wrote it into a new file. This resulting data is only a text file with the dependencies and is not in the desired XML notation.
To summarize, I would like to use the POS and tree data from these PTB files in order to generate a CoreNLP parse corresponding to what would occur if I used CoreNLP on a regular text file. The pseudo command would be like this:
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -useTreeFile wsj_DDXX.mrg
and
java -cp "*" edu.stanford.nlp.pipeline.CoreNLP -usePOSFile wsj_DDXX.pos
Edit: fixed a link.
Yes, this is possible, but a bit tricky and there is no out of the box feature that can do this, so you will have to write some code. The basic idea is to replace the tokenize, ssplit and pos annotators (and in case you also have trees the parse annotator) with your code that loads these annotations from your annotated files.
On a very high level you have to do the following:
Load your trees with MemoryTreebank
Loop through all the trees and for each tree create a sentence CoreMap to which you add
a TokensAnnotation
a TreeAnnotation and the SemanticGraphCoreAnnotations
Create an Annotation object with a list containing the CoreMap objects for all sentences
Run the StanfordCoreNLP pipeline with the annotators option set to lemma,ner,dcoref and the option enforceRequirements set to false.
Take a look at the individual annotators to see how to add the required annotations. E.g. there is a method in ParserAnnotatorUtils that adds the SemanticGraphCoreAnnotations.

Mahout: Importing CSV file to Sequence Files using regexconverter or arff.vector

I just started learning how to use mahout. I'm not a java programmer however, so I'm trying to stay away from having to use the java library.
I noticed there is a shell tool regexconverter. However, the documentation is sparse and non instructive. Exactly what does specifying a regex option do, and what does the transformer class and formatter class do? The mahout wiki is marvelously opaque. I'm assuming the regex option specifies what counts as a "unit" or so.
The example they list is of using the regexconverter to convert http log requests to sequence files I believe. I have a csv file with slightly altered http log requests that I'm hoping to convert to sequence files. Do I simply change the regex expression to take each entire row? I'm trying to run a Bayes classifier, similar to the 20 newsgroups example which seems to be done completely in the shell without need for java coding.
Incidentally, the arff.vector command seems to allow me to convert an arff file directly to vectors. I'm unfamiliar with arff, thought it seems to be something I can easily convert csv log files into. Should I use this method instead, and skip the sequence file step completely?
Thanks for the help.

what is appropriate for me? generateAllGrams() or is generateCollocations() enough for me?

I am developing a project on wordnet-based document summarizer.in that i need to extract collocations. i tried to research as much as I could, but since i have not worked with Mahout before I am having difficulty in understanding how CollocDriver.java works (in API context)
while scouring through the web, i landed on this :
Mahout Collocations
this is the problem: i have a POSTagged input text. i need to identify collocations in it.i have got collocdriver.java code..now i need to know how do i use it? whether to use generateAllGrams() method or only generateCollocations() method is enough for my subtask within my summarizer..??
and most importantly HOW to use it? i raise this question coz I admit, i dont know the API well,
i also got a grepcode version of collocdriver the two implementations seem to be slightly different..the inputs are in string for the grepcode version and in the form of Path object in the original...
my questions: what is configuration object in input params and how to use it?? will the source / destn will be in string (as in grepcode) or Path (as in original)??
what will be the output?
i have done some further R & D on collocdriver program...i found out that it uses a sequence file and then vector generation...i wanna know how this sequence file / vector generation works..plz help..
To get collocation using mahout,u need to follow some simple steps
1) You must make a sequence file from ur input text file.
/bin/mahout seqdirectory -i /home/developer/Desktop/colloc/ -o /home/developer/Desktop/colloc/test-seqdir -c UTF-8 -chunk 5
2)There are two ways to generate collocations from a sequence file.
a)Convert sequence file to sparse vector and find out the collocation
b)Directly find out the collocation from the sequence file (with out creating the sparse vector)
3)Here i am considering choice b.
/bin/mahout org.apache.mahout.vectorizer.collocations.llr.CollocDriver -i /home/developer/Desktop/colloc/test-seqdir -o /home/developer/Desktop/colloc/test-colloc -a org.apache.mahout.vectorizer.DefaultAnalyzer -ng 3 -p
Just check out the output folder,the files u need is over there !!! (in sequence file format)
/bin/mahout seqdumper -s /home/developer/Desktop/colloc/test-colloc/ngrams/part-r-00000 >> out.txt will give u a text output !!!

Resources