I am trying to follow the example on using LDA on the Reuters data as indicated in the Mahout In Action book. However, regardless of the number of times I run it, I always get only one topic.
I ran the command as indicated:
mahout lda -i reuters-vectors/tf-vectors -o reuters-lda-sparse -k 10 -v 34262 -x 20 -ow
I got the number from running seqdumper. After the command has run, I run the LDAPrintTopics as indicated in the book and get the following:
Topic 0
===========
billion [p(billion|topic_0) = 0.04580929884162013
pct [p(pct|topic_0) = 0.043323700764985575
dlrs [p(dlrs|topic_0) = 0.031395871939373196
3 [p(3|topic_0) = 0.027311386657272094
1987 [p(1987|topic_0) = 0.025690077982656934
1 [p(1|topic_0) = 0.022727304049111215
reuter [p(reuter|topic_0) = 0.019572283708227903
mln [p(mln|topic_0) = 0.014569551610736616
april [p(april|topic_0) = 0.014453636611524965
march [p(march|topic_0) = 0.014359948846622552
Is there way to get more topics out of LDA?
Thanks.
Your command says -k 10 which specifies that there need to be 10 topics.
See this https://cwiki.apache.org/MAHOUT/latent-dirichlet-allocation.html
Try changing your data set, may be its too small to generate 10 different topics
Related
I am using coverage.py tool to get coverage of python code. If I use the command without --branch flag like below,
coverage run test_cmd
I get the coverage report like this,
Name Stmts Miss Cover
--------------------------------------------------------------------------------------------------------------------------
/path/file.py 9 2 78%
From this I understand that the cover percentage value is derived as this
cover = (Stmts Covered/total stmts)*100 = (9-2/9)*100 = 77.77%
But when I run coverage with --branch flag enabled like this
coverage run --branch test_cmd
I get the coverage report like this,
Name Stmts Miss Branch BrPart Cover
----------------------------------------------------------------------------------------------------------------------------------------
/path/file.py 9 2 2 1 73%
From this report I am not able to understand the formula used to get Cover=73%
How is this number coming and is this correct value for code coverage?
While you probably shouldn't worry too much about the exact numbers, here is how they are calculated:
Reading from the output you gave, we have:
n_statements = 9
n_missing = 2
n_branches = 2
n_missing_branches = 1
Then it is calculated (I simplified the syntax a bit, in the real code everything is hidden behind properties.)
https://github.com/nedbat/coveragepy/blob/a9d582a47b41068d2a08ecf6c46bb10218cb5eec/coverage/results.py#L205-L207
n_executed = n_statements - n_missing
https://github.com/nedbat/coveragepy/blob/8a5049e0fe6b717396f2af14b38d67555f9c51ba/coverage/results.py#L210-L212
n_executed_branches = n_branches - n_missing_branches
https://github.com/nedbat/coveragepy/blob/8a5049e0fe6b717396f2af14b38d67555f9c51ba/coverage/results.py#L259-L263
numerator = n_executed + n_executed_branches
denominator = n_statements + n_branches
https://github.com/nedbat/coveragepy/blob/master/coverage/results.py#L215-L222
pc_covered = (100.0 * numerator) / denominator
Now put all of this in a script, add print(pc_covered) and run it:
72.72727272727273
It is then rounded of course, so there you have your 73%.
I am wondering if you would be able to give advice about defining a Snakemake rule to combine over one, but not all wildcards? My data is organized so that I have runs and samples; most, but not all samples, were resequenced in every run. Therefore, I have pre-processing steps that are per-sample-run. Then, I have a step that combines BAM files for each run per-sample. However, the issue I'm running into is that I'm a bit confused how to define a rule so that I can list an input of all indivudal bams (from different runs) corresponding to a sample.
I'm putting my entire pipeline below, for clarity, but my real question is on rule combine_bams. How can I list all bams for a single sample in the input?
Any suggestions would be great! Thank you very much in advance!
# Define samples and runs
RUNS, SAMPLES = glob_wildcards("/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz")
print("runs are: ", RUNS)
print("samples are: ", SAMPLES)
rule all:
input:
#trim = ['process/trim/{run}_{samp}_trim_1.fq.gz'.format(samp=sample_id, run=run_id) for sample_id, run_id in zip(sample_ids, run_ids)],
trim = expand(['process/trim/{run}_{samp}_trim_1.fq.gz'], zip, run = RUNS, samp = SAMPLES),
kraken=expand('process/trim/{run}_{samp}_trim_kr_1.fq.gz', zip, run = RUNS, samp = SAMPLES),
bams=expand('process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam', zip, run = RUNS, samp = SAMPLES), # add fixed ref/mapper (expand with zip doesn't allow these to repeate)
combined_bams=expand('process/bams/{samp}_bwa_MTB_ancestor_reference.merged.rmdup.bam', samp = np.unique(SAMPLES))
# Trim reads for quality.
rule trim_reads:
input:
p1='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R1_001.fastq.gz', # update inputs so they only include those that exist use zip.
p2='/labs/jandr/walter/tb/data/Stanford/{run}/{samp}_L001_R2_001.fastq.gz'
output:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
log:
'process/trim/{run}_{samp}_trim_reads.log'
shell:
'/labs/jandr/walter/tb/scripts/trim_reads.sh {input.p1} {input.p2} {output.trim1} {output.trim2} &>> {log}'
# Filter reads taxonomically with Kraken.
rule taxonomic_filter:
input:
trim1='process/trim/{run}_{samp}_trim_1.fq.gz',
trim2='process/trim/{run}_{samp}_trim_2.fq.gz'
output:
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz',
kraken_stats='process/trim/{run}_{samp}_kraken.report'
log:
'process/trim/{run}_{samp}_run_kraken.log'
threads: 8
shell:
'/labs/jandr/walter/tb/scripts/run_kraken.sh {input.trim1} {input.trim2} {output.kr1} {output.kr2} {output.kraken_stats} &>> {log}'
# Map reads.
rule map_reads:
input:
ref_path='/labs/jandr/walter/tb/data/refs/{ref}.fasta.gz',
kr1='process/trim/{run}_{samp}_trim_kr_1.fq.gz',
kr2='process/trim/{run}_{samp}_trim_kr_2.fq.gz'
output:
bam='process/bams/{run}_{samp}_{mapper}_{ref}_rg_sorted.bam'
params:
mapper='{mapper}'
log:
'process/bams/{run}_{samp}_{mapper}_{ref}_map.log'
threads: 8
shell:
"/labs/jandr/walter/tb/scripts/map_reads.sh {input.ref_path} {params.mapper} {input.kr1} {input.kr2} {output.bam} &>> {log}"
# Combine reads and remove duplicates (per sample).
rule combine_bams:
input:
bams = 'process/bams/{run}_{samp}_bwa_MTB_ancestor_reference_rg_sorted.bam'
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
Create a dictionary to associate each sample with its list of runs.
Then for the combine_bams rule, use an input function to generate the input files for that sample using the dictionary.
rule combine_bams:
input:
bams = lambda wildcards: expand('process/bams/{run}_{{samp}}_bwa_MTB_ancestor_reference_rg_sorted.bam', run=sample_dict[wildcards.sample])
output:
combined_bam = 'process/bams/{samp}_{mapper}_{ref}.merged.rmdup.bam'
log:
'process/bams/{samp}_{mapper}_{ref}_merge_bams.log'
threads: 8
shell:
"sambamba markdup -r -p -t {threads} {input.bams} {output.combined_bam}"
I'm trying to set up a dataflow streaming pipeline in python. I have quite some experience with batch pipelines. Our basic architecture looks like this:
The first step is doing some basic processing and takes about 2 seconds per message to get to the windowing. We are using sliding windows of 3 seconds and 3 second interval (might change later so we have overlapping windows). As last step we have the SOG prediction that takes about 15ish seconds to process and which is clearly our bottleneck transform.
So, The issue we seem to face is that the workload is perfectly distributed over our workers before the windowing, but the most important transform is not distributed at all. All the windows are processed one at a time seemingly on 1 worker, while we have 50 available.
The logs show us that the sog prediction step has an output once every 15ish seconds which should not be the case if the windows would be processed over more workers, so this builds up huge latency over time which we don't want. With 1 minute of messages, we have a latency of 5 minutes for the last window. When distribution would work, this should only be around 15sec (the SOG prediction time). So at this point we are clueless..
Does anyone see if there is something wrong with our code or how to prevent/circumvent this?
It seems like this is something happening in the internals of google cloud dataflow. Does this also occur in java streaming pipelines?
In batch mode, Everything works fine. There, one could try to do a reshuffle to make sure no fusion etc occurs. But that is not possible after windowing in streaming.
args = parse_arguments(sys.argv if argv is None else argv)
pipeline_options = get_pipeline_options(project=args.project_id,
job_name='XX',
num_workers=args.workers,
max_num_workers=MAX_NUM_WORKERS,
disk_size_gb=DISK_SIZE_GB,
local=args.local,
streaming=args.streaming)
pipeline = beam.Pipeline(options=pipeline_options)
# Build pipeline
# pylint: disable=C0330
if args.streaming:
frames = (pipeline | 'ReadFromPubsub' >> beam.io.ReadFromPubSub(
subscription=SUBSCRIPTION_PATH,
with_attributes=True,
timestamp_attribute='timestamp'
))
frame_tpl = frames | 'CreateFrameTuples' >> beam.Map(
create_frame_tuples_fn)
crops = frame_tpl | 'MakeCrops' >> beam.Map(make_crops_fn, NR_CROPS)
bboxs = crops | 'bounding boxes tfserv' >> beam.Map(
pred_bbox_tfserv_fn, SERVER_URL)
sliding_windows = bboxs | 'Window' >> beam.WindowInto(
beam.window.SlidingWindows(
FEATURE_WINDOWS['goal']['window_size'],
FEATURE_WINDOWS['goal']['window_interval']),
trigger=AfterCount(30),
accumulation_mode=AccumulationMode.DISCARDING)
# GROUPBYKEY (per match)
group_per_match = sliding_windows | 'Group' >> beam.GroupByKey()
_ = group_per_match | 'LogPerMatch' >> beam.Map(lambda x: logging.info(
"window per match per timewindow: # %s, %s", str(len(x[1])), x[1][0][
'timestamp']))
sog = sliding_windows | 'Predict SOG' >> beam.Map(predict_sog_fn,
SERVER_URL_INCEPTION,
SERVER_URL_SOG )
pipeline.run().wait_until_finish()
In beam the unit of parallelism is the key--all the windows for a given key will be produced on the same machine. However, if you have 50+ keys they should get distributed among all workers.
You mentioned that you were unable to add a Reshuffle in streaming. This should be possible; if you're getting errors please file a bug at https://issues.apache.org/jira/projects/BEAM/issues . Does re-windowing into GlobalWindows make the issue with reshuffling go away?
It looks like you do not necessarily need GroupByKey because you are always grouping on the same key. Instead you could maybe use CombineGlobally to append all the elements inside the window in stead of the GroupByKey (with always the same key).
combined = values | beam.CombineGlobally(append_fn).without_defaults()
combined | beam.ParDo(PostProcessFn())
I am not sure how the load distribution works when using CombineGlobally but since it does not process key,value pairs I would expect another mechanism to do the load distribution.
Okay, this is a Windows specific question.
I need to be able to access the ink levels of a printer connected to a computer. Possibly direct connection, or a network connection.
I recognize that it will likely be different for each printer (or printer company at least) but where can I find the information of how they reveal ink levels to a PC. Also, what is the best language to read this information in?
Okay, this is a OS agnostic answer... :-)
If the printer isn't a very cheapo model, it will have built-in support for SNMP (Simple Network Management Protocol). SNMP queries can return current values from the network devices stored in their MIBs (Management Information Bases).
For printers there's a standard defined called Printer MIB. The Printer MIB defines standard names and tree locations (OIDs == Object Identifiers in ASN.1 notation) for prtMarkerSuppliesLevel which in the case of ink marking printers map to ink levels.
Be aware that SNMP also allows private extensions to the standard MIBs. Most printer vendors do hide many additional pieces of information in their "private MIBs", though the standard info should always be available through the queries of the Printer MIB OIDs.
Practically every programming language has standard libraries which can help you to make specific SNMP queries from your own application.
One such implementation is Open Source, called Net-SNMP, which also comes with a few powerfull commandline tools to run SNMP queries.
I think the OID to query all levels for all inks is .1.3.6.1.2.1.43.11.1.1.9 (this webpage confirms my believe) but I cannot verify that right now, because I don't have a printer around in my LAN at the moment. So Net-SNMP's snmpget command to query ink levels should be something like:
snmpget \
-c public \
192.168.222.111 \
".1.3.6.1.2.1.43.11.1.1.9"
where public is the standard community string and 192.168.222.111 your printer's IP address.
I have an SNMP-capable HP 8600 pro N911a around to do some digging, so the following commands may help you a bit. Beware that this particular model has some firmware problems, you can't query "magenta" with snmpget, but you see a value with snmpwalk (which does some kind of recursive drill-down).
OLD: You can query the names and sequence of values, but I couldn't find the "max value" to calculate a clean percentage so far ;(. I'm guessing so far the values are relative to 255, so dividing by 2.55 yields a percentage.
Update: Marcelo's hint was great! From Registers .8.* you can read the max level per cartridge, and I was totally wrong assuming the max value can only be an 8-bit value. I have updated the sample script to read the max values and calculate c
There is also some discussion over there at Cacti forums.
One answer confirms that the ink levels are measured as percent (value 15 is "percent" in an enumeration):
# snmpwalk -v1 -c public 192.168.100.173 1.3.6.1.2.1.43.11.1.1.7
SNMPv2-SMI::mib-2.43.11.1.1.7.0.1 = INTEGER: 15
SNMPv2-SMI::mib-2.43.11.1.1.7.0.2 = INTEGER: 15
SNMPv2-SMI::mib-2.43.11.1.1.7.0.3 = INTEGER: 15
SNMPv2-SMI::mib-2.43.11.1.1.7.0.4 = INTEGER: 15
You need to install the net-snmp package. If you're not on Linux you might need some digging for SNMP command line tools for your preferred OS.
# snmpwalk -v1 -c public 192.168.100.173 1.3.6.1.2.1.43.11.1.1.6.0
SNMPv2-SMI::mib-2.43.11.1.1.6.0.1 = STRING: "black ink"
SNMPv2-SMI::mib-2.43.11.1.1.6.0.2 = STRING: "yellow ink"
SNMPv2-SMI::mib-2.43.11.1.1.6.0.3 = STRING: "cyan ink"
SNMPv2-SMI::mib-2.43.11.1.1.6.0.4 = STRING: "magenta ink"
# snmpwalk -v1 -c public 192.168.100.173 1.3.6.1.2.1.43.11.1.1.9.0
SNMPv2-SMI::mib-2.43.11.1.1.9.0.1 = INTEGER: 231
SNMPv2-SMI::mib-2.43.11.1.1.9.0.2 = INTEGER: 94
SNMPv2-SMI::mib-2.43.11.1.1.9.0.3 = INTEGER: 210
SNMPv2-SMI::mib-2.43.11.1.1.9.0.4 = INTEGER: 174
# snmpwalk -v1 -c praxis 192.168.100.173 1.3.6.1.2.1.43.11.1.1.8.0
SNMPv2-SMI::mib-2.43.11.1.1.8.0.1 = INTEGER: 674
SNMPv2-SMI::mib-2.43.11.1.1.8.0.2 = INTEGER: 240
SNMPv2-SMI::mib-2.43.11.1.1.8.0.3 = INTEGER: 226
SNMPv2-SMI::mib-2.43.11.1.1.8.0.4 = INTEGER: 241
On my Linux box I use the following script to do some pretty-printing:
#!/bin/sh
PATH=/opt/bin${PATH:+:$PATH}
# get current ink levels
eval $(snmpwalk -v1 -c praxis 192.168.100.173 1.3.6.1.2.1.43.11.1.1.6.0 |
perl -ne 'print "c[$1]=$2\n" if(m!SNMPv2-SMI::mib-2.43.11.1.1.6.0.(\d) = STRING:\s+"(\w+) ink"!i);')
# get max ink level per cartridge
eval $(snmpwalk -v1 -c praxis 192.168.100.173 1.3.6.1.2.1.43.11.1.1.8.0 |
perl -ne 'print "max[$1]=$2\n" if(m!SNMPv2-SMI::mib-2.43.11.1.1.8.0.(\d) = INTEGER:\s+(\d+)!i);')
snmpwalk -v1 -c praxis 192.168.100.173 1.3.6.1.2.1.43.11.1.1.9.0 |
perl -ne '
my #c=("","'${c[1]}'","'${c[2]}'","'${c[3]}'","'${c[4]}'");
my #max=("","'${max[1]}'","'${max[2]}'","'${max[3]}'","'${max[4]}'");
printf"# $c[$1]=$2 (%.0f)\n",$2/$max[$1]*100
if(m!SNMPv2-SMI::mib-2.43.11.1.1.9.0.(\d) = INTEGER:\s+(\d+)!i);'
An alternative approach could be using ipp. While most of the printers I tried support both, I found one which only worked with ipp and one that only worked for me with snmp.
Simple approach with ipptool:
Create file colors.ipp:
{
VERSION 2.0
OPERATION Get-Printer-Attributes
GROUP operation-attributes-tag
ATTR charset "attributes-charset" "utf-8"
ATTR naturalLanguage "attributes-natural-language" "en"
ATTR uri "printer-uri" $uri
ATTR name "requesting-user-name" "John Doe"
ATTR keyword "requested-attributes" "marker-colors","marker-high-levels","marker-levels","marker-low-levels","marker-names","marker-types"
}
Run:
ipptool -v -t ipp://192.168.2.126/ipp/print colors.ipp
The response:
"colors.ipp":
Get-Printer-Attributes:
attributes-charset (charset) = utf-8
attributes-natural-language (naturalLanguage) = en
printer-uri (uri) = ipp://192.168.2.126/ipp/print
requesting-user-name (nameWithoutLanguage) = John Doe
requested-attributes (1setOf keyword) = marker-colors,marker-high-levels,marker-levels,marker-low-levels,marker-names,marker-types
colors [PASS]
RECEIVED: 507 bytes in response
status-code = successful-ok (successful-ok)
attributes-charset (charset) = utf-8
attributes-natural-language (naturalLanguage) = en-us
marker-colors (1setOf nameWithoutLanguage) = #00FFFF,#FF00FF,#FFFF00,#000000,none
marker-high-levels (1setOf integer) = 100,100,100,100,100
marker-levels (1setOf integer) = 6,6,6,6,100
marker-low-levels (1setOf integer) = 5,5,5,5,5
marker-names (1setOf nameWithoutLanguage) = Cyan Toner,Magenta Toner,Yellow Toner,Black Toner,Waste Toner Box
marker-types (1setOf keyword) = toner,toner,toner,toner,waste-toner
marker-levels has current toner/ink levels, marker-high-levels are maximus (so far I've only seen 100s here), marker-names describe meaning of each field (tip: for colors you may want to strip everything after first space, many printers include cartridge types in this field).
Note: the above is with cups 2.3.1. With 2.2.1 I had to specify the keywords as one string instead ("marker-colors,marker-h....). Or it can be left altogether, then all keywords are returned.
More on available attributes (may differ between printers): https://www.cups.org/doc/spec-ipp.html
More on executing ipp calls (including python examples): https://www.pwg.org/ipp/ippguide.html
I really liked tseeling's approach!
Complementarily, I found out that the max value for the OID ... .9 is not 255 as guessed by him, but it actually varies per individual cartridge. The values can be obtained from OID .1.3.6.1.2.1.43.11.1.1.8 (the results obtained by dividing by these values match the ones obtained by running hp-inklevels command from hplip.
I wrote my own script that output CSVs like below (suppose printer IP addr is 192.168.1.20):
# ./hpink 192.168.1.20
black,73,366,19.9454
yellow,107,115,93.0435
cyan,100,108,92.5926
magenta,106,114,92.9825
values are in this order: <color_name>,<level>,<maxlevel>,<percentage>
The script source (one will notice I usually prefer awk over perl when the puzzle is simple enough):
#!/bin/sh
snmpwalk -v1 -c public $1 1.3.6.1.2.1.43.11.1.1 | awk '
/.*\.6\.0\./ {
sub(/.*\./,"");
split($0,TT,/[ "]*/);
color[TT[1]]=TT[4];
}
/.*\.8\.0\./ {
sub(/.*\./,"");
split($0,TT,/[ "]*/);
maxlevel[TT[1]]=TT[4];
}
/.*\.9\.0\./ {
sub(/.*\./,"");
split($0,TT,/[ "]*/);
print color[TT[1]] "," TT[4] "," maxlevel[TT[1]] "," TT[4] / maxlevel[TT[1]] * 100;
}'
I have tried to work with giza++ on window (using Cygwin compiler).
I used this code:
//Suppose source language is French and target language is English
plain2snt.out FrenchCorpus.f EnglishCorpus.e
mkcls -c30 -n20 -pFrenchCorpus.f -VFrenchCorpus.f.vcb.classes opt
mkcls -c30 -n20 -pEnglishCorpus.e -VEnglishCorpus.e.vcb.classes opt
snt2cooc.out FrenchCorpus.f.vcb EnglishCorpus.e.vcb FrenchCorpus.f_EnglishCorpus.e.snt >courpuscooc.cooc
GIZA++ -S FrenchCorpus.f.vcb -T EnglishCorpus.e.vcb -C FrenchCorpus.f_EnglishCorpus.e.snt -m1 100 -m2 30 -mh 30 -m3 30 -m4 30 -m5 30 -p1 o.95 -CoocurrenceFile courpuscooc.cooc -o dictionary
But the after getting the output files from giza++ and evaluate output, I observed that the results were too bad.
My evaluation result was:
RECALL = 0.0889
PRECISION = 0.0990
F_MEASURE = 0.0937
AER = 0.9035
Dose any body know the reason? Could the reason be that I have forgotten some parameters or I should change some of them?
in other word:
first I wanted train giza++ by huge amount of data and then test it by small corpus and compare its result by desired alignment(GOLD STANDARD) , but I don't find any document or useful page in web.
can you introduce useful document?
Therefore I ran it by small courpus (447 sentence) and compared result by desired alignment.do you think this is right way?
Also I changed my code as follows and got better result but It's still not good:
GIZA++ -S testlowsf.f.vcb -T testlowde.e.vcb -C testlowsf.f_testlowde.e.snt -m1 5 -m2 0 -mh 5 -m3 5 -m4 0 -CoocurrenceFile inputcooc.cooc -o dictionary -model1dumpfrequency 1 -model4smoothfactor 0.4 -nodumps 0 -nsmooth 4 -onlyaldumps 1 -p0 0.999 -diagonal yes -final yes
result of evaluation :
// suppose A is result of GIZA++ and G is Gold standard. As and Gs is S link in A And G files. Ap and Gp is p link in A and G files.
RECALL = As intersect Gs/Gs = 0.6295
PRECISION = Ap intersect Gp/A = 0.1090
FMEASURE = (2*PRECISION*RECALL)/(RECALL + PRECISION) = 0.1859
AER = 1 - ((As intersect Gs + Ap intersect Gp)/(A + S)) = 0.7425
Do you know the reason?
Where did you get those parameters? 100 iterations of model1?! Well, if you actually manage to run this, I strongly suspect that you have a very small parallel corpus. If so, you should consider adding more parallel data in training. And how exactly do you calculate the recall and precision measures?
EDIT:
With less than 500 sentences you're unlikely to get any reasonable performance. The usual way to do it is not find a larger (unaligned) parallel corpus, run GIZA++ on both together and then evaluate the small part for which you have the manual alignments. Check Europarl or MultiUN, these are freely available corpora, both contain a relatively large amount of English-French parallel data. The instructions on preparing the data can be found on the websites.