mahout top score words, and false positives

mahout top score words, and false positives - mahout

I've set up mahout to provide some classification for news articles, so i can extract only those news articles which are of interest.
I've gone through an manually trained the titles of these news articles, done approximately 80,000 (both articles i want and don't want)
I have written an app which outputs the top words and their scores, and it seems certain keywords are creeping high up on the top words.
Some of the so called top words are false positives, - they are only top because every title page has them.
such as 'stratford herald' (which is a name of the newspaper) - is there anyway to remove them once a model is already created?
There are about 20 top words which i would like to simply get rid off (or get mahout to ignore when providing best labels), but i don't want this to be an exercise on input (i.e. filtering those names id like to exclude on training input), i'd prefer to post remove as I've already spent a lot of time manually training.
home: 1067
dorset: 1493
details: 908
back: 867
poole: 1651
set: 819
help: 743
get: 812
bournemouth: 14728
new: 2661
avon: 2684
local: 3092
cherries: 1244
police: 1011
over: 1813
echo: 6526
null: 79983
after: 2292
stratford: 2657
school: 1395
jobs: 881
job: 6982
car: 772
herald: 2817
nurse: 1174
man: 1335
manager: 1071
day: 759
time: 764
council: 824
upon: 2676
Number of labels: 2
Number of documents in training set: 79983
Top 75 words for label negative_article
stratford: 10748.598348617554
herald: 7579.555884361267
avon: 7484.692479610443
upon: 7476.3635239601135
local: 7426.4039397239685
after: 3837.6605548858643
man: 3512.4373264312744
police: 2586.899124145508
over: 1537.557123184204
woman: 1434.1630334854126
Top 75 words for label other
bournemouth: 39076.86379265785
job: 24028.39960718155
echo: 22974.801107406616
new: 10888.526140213013
stratford: 8045.635549545288
poole: 7493.278381347656
over: 7077.8266887664795
school: 7011.863867282867
local: 7004.647378444672
dorset: 6961.040742397308

Related

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}

Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

Estimating mortality with acmeR package

There is a relatively new package that has come out called acmeR for producing estimates of mortality (for birds and bats), and it takes into consideration things like bleedthrough (was the carcass still there but undetected, and then found in a later search?), diminishing searcher efficiency, etc. This is extremely useful, except I can't seem to get it to work, despite it seeming to be pretty straightforward. The data structure should be like:
Date, in US format mm/dd/yyyy or ISO 8601 format yyyy-mm-dd
Time, in am/pm US format HH:MM:SS AM or 24-hr ISO format HH:MM:SS
ID, arbitrary distinct alpha strings unique to each carcas
Species, arbitrary distinct alpha strings (e.g. AOU, ABMP, IBP)
Event, “Place”, “Check”, or “Search” (only 1st letter counts)
Found, TRUE or FALSE (only 1st letter counts)
and look something like this:
“Date”,“Time”,“ID”,“Species”,“Event”,“Found”
“1/7/2011”,“08:00:00 PM”,“T091”,“UNBA”,“Place”,TRUE
“1/8/2011”,“12:00:00 PM”,“T091”,“UNBA”,“Check”,TRUE
“1/8/2011”,“16:00:00 PM”,“T091”,“UNBA”,“Search”,FALSE
My data look like this:
Date: Date, format: "2017-11-09" "2017-11-09" "2017-11-09" ...
Time: Factor w/ 644 levels "1:00 PM","1:01 PM",..: 467 491 518 89 164 176 232 39 53 247 ...
Species: Factor w/ 52 levels "AMCR","AMKE",..: 31 27 39 27 39 31 39 45 27 24 ...
ID: Factor w/ 199 levels "GHBT001","GHBT002",..: 1 3 2 3 2 1 2 7 3 5 ...
Event: Factor w/ 3 levels "Check","Place",..: 2 2 2 3 3 3 1 2 1 2 ...
Found: logi TRUE TRUE TRUE FALSE TRUE TRUE ...
I have played with the date, time, event, etc formats too, trying multiple formats because I have had some issues there. So here are some of the errors I have encountered, depending on what subset of data I use:
Error in optim(p0, f, rd = rd, method = "BFGS", hessian = TRUE) :non-finite value supplied by optim In addition: Warning message: In log(c(a0, b0, t0)) : NaNs produced
Error in read.data(fname, spec = spec, blind = blind) : Expecting date format YYYY-MM-DD (ISO) or MM/DD/YYYY (USA) USA
Error in solve.default(rv$hessian): system is computationally singular: reciprocal condition number = 1.57221e-20
Warning message: # In sqrt(diag(Sig)[2]) : NaNs produced
Error in solve.default(rv$hessian) : Lapack routine dgesv: system is exactly singular: U[2,2] = 0
The last error is most common (and note, my data are non-numeric, sooooo... I am not sure what math is being done behind the scenes to come up with these equations, then fail in the solve), but the others are persistent too. Sometimes, despite the formatting being exactly the same in one dataset, a subset of that data will return an error when the parent dataset does not (does not have anything to do with species being there/not being there in one or the other dataset, as far as I can tell).
I cannot find any bug reports or issues with the acmeR package out there - so maybe it runs perfectly and my data are the problem, but after three ecologists have vetted the data and pronounced it good, I am at a dead end.
Here is a subset of the data, so you can see what they look like:
Date Time Species ID Event Found
8 2017-11-09 1:39 PM VATH GHBT007 P T
11 2017-11-09 2:26 PM CORA GHBT004 P T
12 2017-11-09 2:30 PM EUST GHBT006 P T
14 2017-11-09 6:43 AM CORA GHBT004 S T
18 2017-11-09 8:30 AM EUST GHBT006 S T
19 2017-11-09 9:40 AM CORA GHBT004 C T
20 2017-11-09 10:38 AM EUST GHBT006 C T
22 2017-11-09 11:27 AM VATH GHBT007 S F
32 2017-11-09 10:19 AM EUST GHBT006 C F

Parsing complex files with Parsec

I would like to parse files with several sequences of data (same number of column, same content, ...) with Haskell.
My data sequences will be delimited by keywords before and after.
BEGIN
1 882
2 809
3 435
4 197
5 229
6 425
...
END
BEGIN
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
5 246 344 501
6 929 138 474
...
END
My problem is that after several tests with Parsec, I have the impression that Parsec is rather made to parse a file line by line and not the whole file.
Is Parsec the right way to make what I want or should I consider an other tool like Happy or Alex ?
Is there a website (or other ressource) providing examples of parsing complex text files with Parsec ?
Note : The example I give is a very simple one. Things would be more tricky in my files with many more keywords and combinations.

The format as you've described wouldn't be hard at all to handle in parsec.
As for learning how to use it: your first step should be to avoid whatever guide gave you the impression that parsec worked line-by-line. I recommend Chapter 16 of Real World Haskell as a good place to get started, and once you're comfortable with the basics the reference material at http://hackage.haskell.org/package/parsec is actually very clear.

collaborative filtering item-based in mahout - without isolating users

In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.
In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.
Example 1. items are 11-12
Similarity between items:
101 102 0.36602540378443865
Matrix with preferences:
11 12
1 1
2 1
3 1 1
4 1
It looks like mahout treats null as 0.
Example 2. items are 101-103.
Similarity between items:
101 102 0.2612038749637414
101 103 0.4340578302732228
102 103 0.2600070276638468
Matrix with preferences:
101 102 103
1 1 0.1
2 1 0.1
3 1 0.1
4 1 1 0.1
5 1 1 0.1
6 1 0.1
7 1 0.1
8 1 0.1
9 1 0.1
10 1 0.1
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Both examples were run without any additional parameters.
Is this problem solved somwhere, somehow? Any ideas?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf

Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.
The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.
It's doing the right thing.

Logical Addresses & Page numbers

I just started learning Memory Management and have an idea of page,frames,virtual memory and so on but I'm not understanding the procedure from changing logical addresses to their corresponding page numbers,
Here is the scenario-
Page Size = 100 words /8000 bits?
Process generates this logical address:
10 11 104 170 73 309 185 245 246 434 458 364
Process takes up two page frames,and that none of its are resident (in page frames) when the process begins execution.
Determine the page number corresponding to each logical address and fill them into a table with one row and 12 columns.
I know the answer is :
0 0 1 1 0 3 1 2 2 4 4 3
But can someone explain how this is done? Is there a equation or something? I remember seeing something with a table and changing things to binary and putting them in the page table like 00100 in Page 1 but I am not really sure. Graphical representations of how this works would be more than appreciated. Thanks

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

mahout top score words, and false positives - mahout

Related

missing data in time series

Estimating mortality with acmeR package

Parsing complex files with Parsec

collaborative filtering item-based in mahout - without isolating users

Logical Addresses & Page numbers

Categories

Resources