As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):
There is a relatively new package that has come out called acmeR for producing estimates of mortality (for birds and bats), and it takes into consideration things like bleedthrough (was the carcass still there but undetected, and then found in a later search?), diminishing searcher efficiency, etc. This is extremely useful, except I can't seem to get it to work, despite it seeming to be pretty straightforward. The data structure should be like:
Date, in US format mm/dd/yyyy or ISO 8601 format yyyy-mm-dd
Time, in am/pm US format HH:MM:SS AM or 24-hr ISO format HH:MM:SS
ID, arbitrary distinct alpha strings unique to each carcas
Species, arbitrary distinct alpha strings (e.g. AOU, ABMP, IBP)
Event, “Place”, “Check”, or “Search” (only 1st letter counts)
Found, TRUE or FALSE (only 1st letter counts)
and look something like this:
“Date”,“Time”,“ID”,“Species”,“Event”,“Found”
“1/7/2011”,“08:00:00 PM”,“T091”,“UNBA”,“Place”,TRUE
“1/8/2011”,“12:00:00 PM”,“T091”,“UNBA”,“Check”,TRUE
“1/8/2011”,“16:00:00 PM”,“T091”,“UNBA”,“Search”,FALSE
My data look like this:
Date: Date, format: "2017-11-09" "2017-11-09" "2017-11-09" ...
Time: Factor w/ 644 levels "1:00 PM","1:01 PM",..: 467 491 518 89 164 176 232 39 53 247 ...
Species: Factor w/ 52 levels "AMCR","AMKE",..: 31 27 39 27 39 31 39 45 27 24 ...
ID: Factor w/ 199 levels "GHBT001","GHBT002",..: 1 3 2 3 2 1 2 7 3 5 ...
Event: Factor w/ 3 levels "Check","Place",..: 2 2 2 3 3 3 1 2 1 2 ...
Found: logi TRUE TRUE TRUE FALSE TRUE TRUE ...
I have played with the date, time, event, etc formats too, trying multiple formats because I have had some issues there. So here are some of the errors I have encountered, depending on what subset of data I use:
Error in optim(p0, f, rd = rd, method = "BFGS", hessian = TRUE) :non-finite value supplied by optim In addition: Warning message: In log(c(a0, b0, t0)) : NaNs produced
Error in read.data(fname, spec = spec, blind = blind) : Expecting date format YYYY-MM-DD (ISO) or MM/DD/YYYY (USA) USA
Error in solve.default(rv$hessian): system is computationally singular: reciprocal condition number = 1.57221e-20
Warning message: # In sqrt(diag(Sig)[2]) : NaNs produced
Error in solve.default(rv$hessian) : Lapack routine dgesv: system is exactly singular: U[2,2] = 0
The last error is most common (and note, my data are non-numeric, sooooo... I am not sure what math is being done behind the scenes to come up with these equations, then fail in the solve), but the others are persistent too. Sometimes, despite the formatting being exactly the same in one dataset, a subset of that data will return an error when the parent dataset does not (does not have anything to do with species being there/not being there in one or the other dataset, as far as I can tell).
I cannot find any bug reports or issues with the acmeR package out there - so maybe it runs perfectly and my data are the problem, but after three ecologists have vetted the data and pronounced it good, I am at a dead end.
Here is a subset of the data, so you can see what they look like:
Date Time Species ID Event Found
8 2017-11-09 1:39 PM VATH GHBT007 P T
11 2017-11-09 2:26 PM CORA GHBT004 P T
12 2017-11-09 2:30 PM EUST GHBT006 P T
14 2017-11-09 6:43 AM CORA GHBT004 S T
18 2017-11-09 8:30 AM EUST GHBT006 S T
19 2017-11-09 9:40 AM CORA GHBT004 C T
20 2017-11-09 10:38 AM EUST GHBT006 C T
22 2017-11-09 11:27 AM VATH GHBT007 S F
32 2017-11-09 10:19 AM EUST GHBT006 C F
I would like to parse files with several sequences of data (same number of column, same content, ...) with Haskell.
My data sequences will be delimited by keywords before and after.
BEGIN
1 882
2 809
3 435
4 197
5 229
6 425
...
END
BEGIN
1 235 623 684
2 871 699 557
3 918 686 49
4 53 564 906
5 246 344 501
6 929 138 474
...
END
My problem is that after several tests with Parsec, I have the impression that Parsec is rather made to parse a file line by line and not the whole file.
Is Parsec the right way to make what I want or should I consider an other tool like Happy or Alex ?
Is there a website (or other ressource) providing examples of parsing complex text files with Parsec ?
Note : The example I give is a very simple one. Things would be more tricky in my files with many more keywords and combinations.
The format as you've described wouldn't be hard at all to handle in parsec.
As for learning how to use it: your first step should be to avoid whatever guide gave you the impression that parsec worked line-by-line. I recommend Chapter 16 of Real World Haskell as a good place to get started, and once you're comfortable with the basics the reference material at http://hackage.haskell.org/package/parsec is actually very clear.
In mahout there is implemented method for item based Collaborative filtering called itemsimilarity.
In the theory, similarity between items should be calculated only for users who ranked both items. During testing I realized that in mahout it works different.
In below example the similarity between item 11 and 12 should be equal 1, but mahout output is 0.36.
Example 1. items are 11-12
Similarity between items:
101 102 0.36602540378443865
Matrix with preferences:
11 12
1 1
2 1
3 1 1
4 1
It looks like mahout treats null as 0.
Example 2. items are 101-103.
Similarity between items:
101 102 0.2612038749637414
101 103 0.4340578302732228
102 103 0.2600070276638468
Matrix with preferences:
101 102 103
1 1 0.1
2 1 0.1
3 1 0.1
4 1 1 0.1
5 1 1 0.1
6 1 0.1
7 1 0.1
8 1 0.1
9 1 0.1
10 1 0.1
Similarity between items 101 and 102 should be calculated using only ranks for users 4 and 5, and the same for items 101 and 103 (that should be based on theory). Here (101,103) is more similar than (101,102), and it shouldn't be.
Both examples were run without any additional parameters.
Is this problem solved somwhere, somehow? Any ideas?
Source: http://files.grouplens.org/papers/www10_sarwar.pdf
Those users are not identical. Collaborative filtering needs to have a measure of cooccurrence and the same items do not cooccur between those users. Likewise the items are not identical, they each have different users who prefered them.
The data is turned into a "sparse matrix" where only non-zero values are recorded. The rest are treated as a 0 value, this is expected and correct. The algorithms treat 0 as no preference, not a negative preference.
It's doing the right thing.
I just started learning Memory Management and have an idea of page,frames,virtual memory and so on but I'm not understanding the procedure from changing logical addresses to their corresponding page numbers,
Here is the scenario-
Page Size = 100 words /8000 bits?
Process generates this logical address:
10 11 104 170 73 309 185 245 246 434 458 364
Process takes up two page frames,and that none of its are resident (in page frames) when the process begins execution.
Determine the page number corresponding to each logical address and fill them into a table with one row and 12 columns.
I know the answer is :
0 0 1 1 0 3 1 2 2 4 4 3
But can someone explain how this is done? Is there a equation or something? I remember seeing something with a table and changing things to binary and putting them in the page table like 00100 in Page 1 but I am not really sure. Graphical representations of how this works would be more than appreciated. Thanks