I'm using mahout 0.7 on a pseudo-distributed hadoop installation for testing purposes.
A lot of what I'm doing is being guided by Mahout in Action, which I know deals with 0.5, but as far as I can tell, nothing major has changed with seq2sparse.
I'm having a problem with the tfidf vectors generated by seq2sparse. No matter what I set "-x" (max document frequency percentage) to, I end up with the same number of terms in my dictionary, and vectors of the same size.
I found one posting about mahout 0.6 where -x was being parsed as an absolute number of documents rather than a percentage of documents. That was supposed to have been fixed in 0.7, but I tried using it in that way too just to see if it would help. No change in the number of terms I'm getting. Here are the values I've tried, and the number of terms I've ended up with. My data set is 4850 wikipedia articles from: http://dumps.wikimedia.org/enwiki/20110803/
The exact file is: pages-articles1.xml.bz2
The xml file was turned into a seqfile with:
mahout seqwiki -all -i <path to xml file> -o <path to output directory>
My calls to seq2sparse look like this:
mahout seq2sparse -i <seq directory> -o <out dir> -ow -wt tfidf -x 4800 -nv
My results:
|-x value| #of terms |
|4800 | 256623 |
|4600 | 256623 |
|2500 | 256623 |
|99 | 256623 |
|90 | 256623 |
|25 | 256623 |
|5 | 256623 |
Any ideas on what I'm doing wrong?
I ended up asking this question on the mahout user mailing list and got an answer. I'll reproduce it here for anybody wondering the same thing I was:
Dave Byrne - "maxDFPercent won't actually remove the terms from the dictionary, or reduce the size of the tfidf vectors. It simply sets the value of the vector to 0 for that term.
In other words, the dictionary size and vector length will remain the same, with fewer non-zero terms."
Related
Am given a list if ID which I need to trace back a name in a file
file: ID contains
1
2
3
4
5
6
The ID are contained in a Large 2 GB file called result.txt
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
So I cat the ID file into a variable
I then use this variable in a loop to grep out the values to link back to the name using grep and cut -d from results.txt and output to a variable
so variable contains ABS CDE FG1
In the same loop I pass the output of the grep to perform another grep on results.txt, to get the name
ie regrets file for ABC CDE FG1
I do get the answer but takes a long time is their a more efficient way?
Thanks
Making some assumptions about your requirement... ID's that are not found in the big file will not be shown in the output; the desired output is in the format shown below.
Here are mock input files - f1 for the id's and f2 for the large file:
[mathguy#localhost test]$ cat f1
1
2
3
4
5
6
[mathguy#localhost test]$ cat f2
ABC=John,dhds,72828,73737,3939,92929
CDE=John,uubad,32424,ajdaio,343533
FG1=Peter,iasisaio,097282,iosoido
WER=Ann,97391279,89719379,7391739
result,**id=1**,iuhdihdio,ihwoihdoih,iuqhwiuh,ABC
result2,**id=2**,9729179,hdqihi,hidqi,82828,CDE
result3,**id=3**,biasi,8u9829,90u209w,jswjso,FG1
Proposed solution and output:
[mathguy#localhost test]$ sed 's/.*/\*\*id=&\*\*/' f1 | grep -Ff - f2 | \
> sed -E 's/^.*\*\*id=([[:digit:]]*)\*\*.*,([^,]*)$/\1 \2/'
1 ABC
2 CDE
3 FG1
The hard work here is done by grep -F which might be just fast enough for your needs. There is some prep work and some clean-up work done by sed, but those are both on small datasets.
First we take the id's from the input file and we output strings in the format **id=<number>**. The output is presented as the fixed-character patterns to grep -F via the option -f (take the patterns from file, in this case from stdin, invoked as -; that is, from the output of sed).
After we find the needed lines from the big file, the final sed just extracts the id and the name from each line.
Note: this assumes that each id is only found once in the big file. (Actually the command will work regardless; but if there are duplicate lines for an id, your business users will have to tell you how to handle. What if you get contradictory names for the same id? Etc.)
I tried to use grep to search for lines containing the word "bead" using "\b" but it doesn't find the lines containing the word "bead" separated by space. I tried this script:
cat in.txt | grep -i "\bbead\b" > out.txt
I get results like
BEAD-air.JPG
Bead, 3 sided MET DP110317.jpg
Bead. -2819 (FindID 10143).jpg
Bead(Gem), Artefacts of Phu Hoa site(Dong Nai province).jpg
Romano-British pendant amulet (bead) (FindID 241983).jpg
But I don't get the results like
Bead fun.jpg
Instead of getting some 2,000 lines, I'm only getting 92 lines
My OS is Windows 10 - 64 bit but I'm using grep 2.5.4 from the GnuWin32 package.
I've also tried the MSYS2, which includes grep 3.0 but it does the same thing.
And then, how can I search for words separated by space?
LATER EDIT:
It looks like grep has problems with big files. My input file is 2.4 GB in size. With smaller files, it works - I reported the bug here: https://sourceforge.net/p/getgnuwin32/discussion/554300/thread/03a84e6b/
Try this,
cat in.txt | grep -wi "bead"
-w provides you a whole word search
What you are doing normally should work but there are ways of setting what is and is not considered a word boundary. Rather than worry about it please try this instead:
cat in.txt | grep -iP "\bbead(\b|\s)" > out.txt
The P option adds in Perl regular expression power and the \s matches any sort of space character. The Or Bar | separates options within the parens ( )
While you are waiting for grep to be fixed you could use another tool if it is available to you. E.g.
perl -lane 'print if (m/\bbead\b/i);' in.txt > out.txt
I am looking for a solution to extract the list of concepts that a text (or html) document is about. I'd like the concepts to be wikidata topics (or freebase or DBpedia).
For example "Bad is a song by Mikael Jackson" should return Michael Jackson (the artist, wikidata Q2831) and Bad (the song, wikidata Q275422). As this example shows, the system should be robust to spelling mistakes (Mikael) and ambiguity (Bad).
Ideally the system should work across multiple languages, it should work both on short texts and long texts, and when it is unsure it should return multiple topics (eg. Bad song + Bad album). Also, it should ideally be open source and have a python API.
Yes, that sounds like a list for Santa Claus. Any ideas?
Edit
I checked out a few solutions, but no silver bullet so far.
NLTK parses text and extract "named entities" (AFAIU, a part of a sentence that refers to a name), but it does not return Wikidata topics, just plain text. This means that it will likely not understand that "I shot the sheriff" is the name of a song by Bob Marley, it will instead treat this as a sentence.
OpenNLP does roughly the same.
Wikidata has a search API, but it's just one term at a time, and it does not handle disambiguation.
There are a few commercial services (OpenCalais, AlchemyAPI, CogitoAPI...) but none really shines, IMHO.
You can use Spacy to retrieve Named Entity then link them to WikiData using the search API.
For what remains of the sentence that is not matched as named entity by Spacy you can create a list of ngrams from the sentence starting with the biggest ngram you use the WikiData search API to lookup WikiData topics.
POS tagging can be put to good use, that said syntax parse informations is more powerful since you can know the relations between the words. For instance given the following output from link-grammar:
Found 8 linkages (8 had no P.P. violations)
Linkage 1, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)
+-------------------------Xp-------------------------+
+----------->WV---------->+ |
+-------Wd------+ +---------Osn--------+ |
| +---G---+----Ss---+----Os----+ | |
| | | | | | |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .
You can tell that the subject is “Bob Marley” because
“wrote” is connected to “Marley” with a S which connects subject nouns to finite verbs.
“Marley” is connected to “Bob” using a G which connects proper noun together.
So a “Bob Marley” is a good candidate for an entity (also it has both word capitalized).
Given the above parse "tree" it difficult to tell whether “Natural” and “Mystic” are related even if they are on the same side of the sentence.
The second parse provided by link grammar has the same cost vector and links together “Natural Mystic” with again a G.
Here is it:
Linkage 2, cost vector = (UNUSED=0 DIS= 0.15 LEN=9)
+-------------------------Xp-------------------------+
+----------->WV---------->+ |
+-------Wd------+ +---------Os---------+ |
| +---G---+----Ss---+ +----G----+ |
| | | | | | |
LEFT-WALL Bob.m Marley[!] wrote.v-d Natural[!] Mystic[!] .
So in my opinion “Bob Marley” and “Natural Mystic” are good candidate for a wikidata search.
That was the easy problem where grammar and spelling are correct.
Here is one parse out of 11 of the same sentence with lower cases:
Linkage 1, cost vector = (UNUSED=1 DIS= 0.15 LEN=14)
+------------------------Xp------------------------+
+----------------------Wa---------------------+ |
| +------------------AN-----------------+ |
| | +-------------AN-------------+ |
| | | +----AN---+ |
| | | | | |
LEFT-WALL Bob.m marley[?].n [wrote] natural.n mystic.n .
LG doesn't even recognize the verb.
I run a command that produce lots of lines in my terminal - the lines are floats.
I only want certain numbers to be output as a line in my terminal.
I know that I can pipe the results to egrep:
| egrep "(369|433|375|368)"
if I want only certain values to appear. But is it possible to only have lines that have a value within ± 50 of 350 (for example) to appear?
grep matches against string tokens, so you have to either:
figure out the right string match for the number range you want (e.g., for 300-400, you might do something like grep -E [34].., with appropriate additional context added to the expression and a number of additional .s equal to your floating-point precision)
convert the number strings to actual numbers in whatever programming language you prefer to use and filter them that way
I'd strongly encourage you to take the second option.
I would go with awk here:
./yourProgram | awk '$1>250 && $1<350'
e.g.
echo -e "12.3\n342.678\n287.99999" | awk '$1>250 && $1<350'
342.678
287.99999
Command:
grep -oP '(?<=\"name\":\")[^"]*|(?<=\"title\":\")[^"]*' *.json >newjson
o/p getting as,
10XANY10G_1.json:chMax
10XANY10G_1.json:Max Frequency in GHz
10XANY10G_1.json:up
10XANY10G_1.json:UP
10XANY10G_1.json:down
10XANY10G_1.json:DOWN
10XANY10G_1.json:CapabilityList
10XANY10G_1.json:Capabilities
10XANY10G_1.json:encoding
10XANY10G_1.json:Encoding
expected o/p:
chMax:"Max Frequency in GHz",
up:"UP",
down:"DOWN",
contents of file:
{"card":{"cardName":"10AN10G","portSignalRates":["10AN10G-1-OTU2","10AN10G-1-OTU2E","10AN10G-1-TENGIGE","10AN10G-1-STM64"],"listOfPort":{"10AN10G-1-OTU2":{"portAid":"10AN10G-1-OTU2","signalType":"OTU2","tabNames":["PortDetails"],"requestType":{"PortDetails":"PTP"},"paramDetailsMap":{"PortDetails":[{"type":"dijit.form.TextBox","name":"signalType","title":"Signal Rate","id":"","options":[],"label":"","value":"OTU2","checked":"","enabled":"false","selected":""},{"type":"dijit.form.TextBox","name":"userLabel","title":"Description","id":"","options":[],"label":"","value":"","checked":"","enabled":"true","selected":""},{"type":"dijit.form.Select","name":"Frequency","title":"Transmit Frequency",}}}}}}
I think you're looking for this,
$ grep -oP '(?<=\"name\":\")[^"]*|(?<=\"title\":)[^,]*' file
signalType
"Signal Rate"
userLabel
"Description"
Frequency
"Transmit Frequency"
To get the desired output
$ grep -oP '(?<=\"name\":\")[^"]*|(?<=\"title\":)[^,]*' file | paste -d: - -
signalType:"Signal Rate"
userLabel:"Description"
Frequency:"Transmit Frequency"
I think that your problem is that you are ORing the two groups with | try removing the | and you will get closer to what you are looking for but you may have to add a term to skip any intervening tags for cases where name and title are not immediately after each other, then you might have to get clever to deal with the case that has an entry with a name but no title.
As said grep is not the best tool for parsing json there are numerous others - personally I would suggest using python and the json library to load your file then output the tags that you need.
Here is an gnu awk (due to RS) version to extract data:
awk -F\" '/title/ {print $3":"$7}' RS='name' file
signalType:Signal Rate
userLabel:Description
Frequency:Transmit Frequency