As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):
Related
Imagine we have a dataset in Google Sheets representing a grading book. Columns E, G, I, K, and M represent the score one has achieved for questions 1 to 5, and rows 5 to 64 are the student names. I want to see how many of the students have solved at least 3 questions; Here, by solving I mean that the student has gotten a full mark on that question (also the grade distribution can vary; for example, question 1 has 10 points while the other have 25 points).
Note that one thing that popped into my mind was to create a new column and store the number of solved questions for each student there (and then iterate over them and see how many of them are >= 3); Is there a way to satisfy the problem without creating or using new row/columns?
I didn't find anything proper that had to deal with rows and also keeping track of the cell count in those rows. One approach is to use Inclusion–exclusion principle with this link here. It'd basically be something like
COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4) + COUNTIFS(E5:E64,E4,G5:G64,G4,K5:K64,K4) + COUNTIFS(E5:E64,E4,G5:G64,G4,M5:M64,M4) + COUNTIFS(E5:E64,E4,I5:I64,I4,K5:K64,K4) + COUNTIFS(E5:E64,E4,I5:I64,I4,M5:M64,M4) + COUNTIFS(E5:E64,E4,K5:K64,K4,M5:M64,M4) + COUNTIFS(G5:G64,G4,I5:I64,I4,K5:K64,K4) + COUNTIFS(G5:G64,G4,I5:I64,I4,M5:M64,M4) + COUNTIFS(G5:G64,G4,K5:K64,K4,M5:M64,M4) + COUNTIFS(I5:I64,I4,K5:K64,K4,M5:M64,M4) - (COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4,K5:K64,K4) + COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4,M5:M64,M4) + COUNTIFS(E5:E64,E4,G5:G64,G4,K5:K64,K4,M5:M64,M4) + COUNTIFS(E5:E64,E4,I5:I64,I4,K5:K64,K4,M5:M64,M4) + COUNTIFS(G5:G64,G4,I5:I64,I4,K5:K64,K4,M5:M64,M4) - COUNTIFS(E5:E64,E4,G5:G64,G4,I5:I64,I4,K5:K64,K4,M5:M64,M4))
this link was the closest I got.
I think using matrices and multiplying them could be the solution. However, I'm not very good at that!
I'd appreciate any help.
Thanks in advance.
Update: here is a table to better understand this problem. The formula should return 2 (w and z both are satisfiable).
Student Name
Question 1
Question 2
Question 3
Question 4
Question 5
Mr. x
10
14
17
8
25
Mr. y
8
25
25
14
19
Mr. w
10
25
17
8
25
Mr. z
10
14
25
25
25
This should cover it:
=SUMPRODUCT(--(((E5:E64=E4)+(G5:G64=G4)+(I5:I64=I4)+(K5:K64=K4)+(M5:M64=M4))>=3))
I've got hiking distance data from a start point in column A and a column with a yes/no condition (let's say a "Y" denotes a campsite, for example).
What I'm trying to achieve is to calculate the distance between each distance marker in column A that has the condition "Y" in column B. (Desired output is column C.)
A B C
--------------
0 Y
12
26 Y 26 (26 - 0 = 26)
57
124 Y 98 (124 - 26 = 98)
137
152 Y 28 (152 - 124 = 28)
169
. . .
. . .
. . .
I can pull out the distance from column A with a simple IF statement, but that doesn't get me anywhere, of course.
I've searched the Internet extensively and there are a ton of threads out there about finding the last value or last non-empty value in a column.
So I've tried to use INDEX, FILTER, and LOOKUP in all sorts of combinations, but sadly nothing produces the result I'm looking for.
The tricky part, I guess, is to find the last value with a Y above the "current" Y (if that makes any sense).
In C2 try
=ArrayFormula(if(B2:B="y", A2:A-iferror(vlookup(row(A2:A)-1, filter({row(A2:A), A2:A}, len(B2:B)),2)),))
and see if that works?
Just looking for some brief advice to put me back on the right track. I have been working on a solution to a problem where I have a very sparse input matrix (~25% of information filled, rest is 0's) stored in a sparse.coo_matrix:
sparse_matrix = sparse.coo_matrix((value, (rater, blurb))).toarray()
After some work on building this array from my data set and messing around with some other options, I currently have my NMF model fitter function defined as follows:
def nmf_model(matrix):
model = NMF(init='nndsvd', random_state=0)
W = model.fit_transform(matrix);
H = model.components_;
result = np.dot(W,H)
return result
Now, the issue is my output doesn't seem to be accounting for the 0 values correctly. Any value that was a 0 gets bumped to some value less than 1 and my known values fluctuate from the actual quite a bit (All data are ratings between 1 and 10). Can anyone spot what I am doing wrong? From the documentation for scikit, I assumed using the nndsvd initialization would help account for the empty values correct. Sample output:
#Row / Column / New Value
35 18 6.50746917334 #Actual Value is 6
35 19 0.580996641675 #Here down are all "estimates" of my function
35 20 1.26498699492
35 21 0.00194119935464
35 22 0.559623469753
35 23 0.109736902936
35 24 0.181657421405
35 25 0.0137801897011
35 26 0.251979684515
35 27 0.613055371646
35 28 6.17494590041 #Actual values is 5.5
Appreciate any advice any more experienced ML coders can offer!
I just entered into the space of data mining, machine learning and clustering. I'm having special problem, and do not know which technique to use it for solving it.
I want to perform clustering of observations (objects or whatever) on specific data format. All variables in each observation is numeric. My data input looks like this:
1 2 3 4 5 6
1 3 5 7
2 9 10 11 12 13 14
45 1 22 23 24
Let's say that n represent row (observation, or 1D vector,..) and m represents column (variable index in each vector). n could be very large number, and 0 < m < 100. Also main point is that same observation (row) cannot have identical values (in 1st row, one value could appear only once).
So, I want to somehow perform clustering where I'll put observations in one cluster based on number of identical values which contain each row/observation.
If there are two rows like:
1
1 2 3 4 5
They should be clustered in same cluster, if there are no match than for sure not. Also number of each rows in one cluster should not go above 100.
Sick problem..? If not, just for info that I didn't mention time dimension. But let's skip that for now.
So, any directions from you guys,
Thanks and best regards,
JDK
Its hard to recommend anything since your problem is totally vague, and we have no information on the data. Data mining (and in particular explorative techniques like clustering) is all about understanding the data. So we cannot provide the ultimate answer.
Two things for you to consider:
1. if the data indicates presence of species or traits, Jaccard similarity (and other set based metrics) are worth a try.
2. if absence is less informative, maybe you should be mining association rules, not clusters
Either way, without understanding your data these numbers are as good as random numbers. You can easily cluster random numbers, and spend weeks to get the best useless result!
Can your problem be treated as a Bag-of-words model, where each article (observation row) has no more than 100 terms?
Anyway, I think your have to give more information and examples about "why" and "how" you want to cluster these data. For example, we have:
1 2 3
2 3 4
2 3 4 5
1 2 3 4
3 4 6
6 7 8
9 10
9 11
10 12 13 14
What is your expected clustering? How many clusters are there in this clustering? Only two clusters?
Before you give more information, according to you current description, I think you do not need a cluster algorithm, but a structure of connected components. The first round you process the dataset to get the information of connected components, and you need a second round to check each row belong to which connected components. Take the example above, first round:
1 2 3 : 1 <- 1, 1 <- 2, 1 <- 3 (all point linked to the smallest point to
represent they are belong to the same cluster of the smallest point)
2 3 4 : 2 <- 4 (2 and 3 have already linked to 1 which is <= 2, so they do
not need to change)
2 3 4 5 : 2 <- 5
1 2 3 4 : 1 <- 4 (in fact this change are not essential because we have
1 <- 2 <- 4, but change this can speed up the second round)
3 4 6 : 3 <- 6
6 7 8 : 6 <- 7, 6 <- 8
9 10 : 9 <- 9, 9 <- 10
9 11 : 9 <- 11
10 11 12 13 14 : 10 <- 12, 10 <- 13, 10 <- 14
Now we have a forest structure to represent the connected components of points. The second round you can easily pick up one point in each row (the smallest one is the best) and trace its root in the forest. The rows which have the same root are in the same, in your words, cluster. For example:
1 2 3 : 1 <- 1, cluster root 1
2 3 4 5 : 1 <- 1 <- 2, cluster root 1
6 7 8 : 1 <- 1 <- 3 <- 6, cluster root 1
9 10 : 9 <- 9, cluster root 9
10 11 12 13 14 : 9 <- 9 <- 10, cluster root 9
This process takes O(k) space where k is the number of points, and O(nm + nh) time, where r is the height of the forest structure, where r << m.
I am not sure if this is the result you want.
I have a large text file as below imported in MATLAB:
Run Lat Long Time
1 32 32 34
1 23 22 21
2 23 12 11
2 11 11 11
2 33 11 12
up to 10 runs etc.
So I'm trying to break up each section in the file: section 1, section 2, etc and write it to 10 different text files. File 1 will have data from Run 1. File 2 will have data from Run 2.
What you're looking for is Matlab's textread function. I'll give you the pieces you need and frame out the logic, but you'll need to connect the pieces yourself :)
Your read would look something like this
[head1, head2, head3, head4] = textread(file_name,'%s %s %s %s',1);
[run, lat, long, time] = textread(file_name,'%u %u %u %u');
and your write method would use a loop to iterate over the values in
unique(run)
creating a file with
fout = fopen([base_file_name_out num2str(run_number)]);
and writing to it the values contained in
lat_this_run=Lat(run==run_number);
using the method
fprintf(fout,'%u %u %u\n', lat_this_run, long_this_run, time_this_run)
If your data is already loaded into matlab and named A, you could do:
>> a = max(A(:,1));
>> AA={};
>> for i = 1:a
AA{i}=A(find(A(:,1)==i),:)
name=sprintf('%d.txt',i);
dlmwrite(name,AA{i},'\t');
end
The output will be .txt files containing tab-delimited data.