Power Query - regional settings - delimiter

I've made a best practice sheet for my department, which works perfectly on my own computer (en-US regional settings), but when I sent it to a coworker, everything blew up (da-DK regional settings).
Half the department uses en-US and the other half uses da-DK. It's not an option for everybody to use the same, so I'd like to create a sheet that can handle both automatically.
The csv files created in en-US uses "," delimiter and a "." for decimal, while the da-DK settings uses ";" as delimiter and "," for decimals.
How would I best go about this issue?
I have the following two identical data queries
Data1.csv:
Panel/Node/Case, MXX (kNm/m), MYY (kNm/m)
1/ 1/ 1, 145.46, 145.46
1/ 1/ 2, 98.83, 98.83
1/ 1/ 3 (C), 244.30, 244.30
1/ 2/ 1, 19.80, 19.80
1/ 2/ 2, 13.46, 13.46
1/ 2/ 3 (C), 33.26, 33.26
1/ 3/ 1, 19.80, 19.80
1/ 3/ 2, 13.46, 13.46
1/ 3/ 3 (C), 33.26, 33.26
1/ 4/ 1, 145.46, 145.46
1/ 4/ 2, 98.83, 98.83
1/ 4/ 3 (C), 244.30, 244.30
Data2.csv:
Panel/Node/Case; MXX (kNm/m); MYY (kNm/m)
1/ 1/ 1; 145,46; 145,46
1/ 1/ 2; 98,83; 98,83
1/ 1/ 3 (C); 244,30; 244,30
1/ 2/ 1; 19,80; 19,80
1/ 2/ 2; 13,46; 13,46
1/ 2/ 3 (C); 33,26; 33,26
1/ 3/ 1; 19,80; 19,80
1/ 3/ 2; 13,46; 13,46
1/ 3/ 3 (C); 33,26; 33,26
1/ 4/ 1; 145,46; 145,46
1/ 4/ 2; 98,83; 98,83
1/ 4/ 3 (C); 244,30; 244,30
I was considering to use a replace values function to replace the characters I need, but as both queries contain a "," that proves difficult.

Maybe there is some dynamic/built-in means of detecting a file's content's locale and delimiter. I haven't found one yet. Until someone comes along and points out a better way of doing this, maybe the below can serve as an interim solution.
The parseCsv function in the code below attempts to infer the delimiter (it assumes that the delimiter will always be present in the first line of the CSV, immediately after Panel/Node/Case).
It also tries to transform the last two columns' values from text to numbers (by replacing , with .). If you don't want this behaviour, you can delete it from the code. Also, this transformation will only work for locales that use . for the decimal separator. (If you need to support other cultures/locales, it might make more sense to try to infer the culture/locale and pass it as a second argument to Number.FromText.)
let
data1 = Text.ToBinary(
"Panel/Node/Case, MXX (kNm/m), MYY (kNm/m)
1/ 1/ 1, 145.46, 145.46
1/ 1/ 2, 98.83, 98.83
1/ 1/ 3 (C), 244.30, 244.30
1/ 2/ 1, 19.80, 19.80
1/ 2/ 2, 13.46, 13.46
1/ 2/ 3 (C), 33.26, 33.26
1/ 3/ 1, 19.80, 19.80
1/ 3/ 2, 13.46, 13.46
1/ 3/ 3 (C), 33.26, 33.26
1/ 4/ 1, 145.46, 145.46
1/ 4/ 2, 98.83, 98.83
1/ 4/ 3 (C), 244.30, 244.30", TextEncoding.Utf8),
data2 = Text.ToBinary(
"Panel/Node/Case; MXX (kNm/m); MYY (kNm/m)
1/ 1/ 1; 145,46; 145,46
1/ 1/ 2; 98,83; 98,83
1/ 1/ 3 (C); 244,30; 244,30
1/ 2/ 1; 19,80; 19,80
1/ 2/ 2; 13,46; 13,46
1/ 2/ 3 (C); 33,26; 33,26
1/ 3/ 1; 19,80; 19,80
1/ 3/ 2; 13,46; 13,46
1/ 3/ 3 (C); 33,26; 33,26
1/ 4/ 1; 145,46; 145,46
1/ 4/ 2; 98,83; 98,83
1/ 4/ 3 (C); 244,30; 244,30", TextEncoding.Utf8),
parseCsv = (someFile as binary) =>
let
lines = Lines.FromBinary(someFile, QuoteStyle.Csv, false, TextEncoding.Utf8),
firstLine = List.First(lines),
expectedDelimiterPosition = Text.Length("Panel/Node/Case"),
delimiterInferred = Text.At(firstLine, expectedDelimiterPosition),
csv = Csv.Document(someFile, [Delimiter = delimiterInferred, Encoding = TextEncoding.Utf8, QuoteStyle = QuoteStyle.Csv]),
promoted = Table.PromoteHeaders(csv, [PromoteAllScalars=true]),
lastTwoColumnsAsNumbers =
let
lastTwoHeaders = List.LastN(Table.ColumnNames(promoted), 2),
replaceAndConvertToNumber = (someText as text) as number => Number.From(Text.Replace(someText, ",", ".")),
transformers = List.Transform(lastTwoHeaders, each {_, replaceAndConvertToNumber, type number}),
transformed = Table.TransformColumns(promoted, transformers)
in transformed
in lastTwoColumnsAsNumbers,
parsed1 = parseCsv(data1),
parsed2 = parseCsv(data2),
parsed3 = parseCsv(File.Contents("C:\Users\MRCH\Desktop\Data1.csv"))
in
parsed3
To implement this, you can copy the code above, create a blank query (in my version of Excel, I do this via: Data > Get Data > From Other Sources > From Blank Query), click Advanced Editor (near the top left), delete any existing code, paste what you've copied, then click "Done".
To make the parseCsv function work with a file path, you could for example change parsed1 = parseCsv(data1) to parsed1 = parseCsv(File.Contents("SOME_FILE_PATH")) where SOME_FILE_PATH is the file path to Data1.csv on your machine (keep the double quotes).
In the Query Editor, you can click on and view expressions/steps parsed1 and parsed2 (which are basically what the parseCsv function returns for Data1.csv and Data2.csv respectively). data1 and data2 are just there for demonstrative purposes and you'd replace them with the actual binary content of your CSVs.
If that doesn't help, let me know where I can improve my explanation.

Related

missing data in time series

As im so new to this field and im trying to explore the data for a time series, and find the missing values and count them and study a distribution of their length and fill in these gaps, the thing is i have, let's say 10 file.txt and for each file i have 2 columns as follows:
C1 C2
944 0
920 1
920 2
928 3
912 7
920 8
920 9
880 10
888 11
920 12
944 13
and so on... lets say till 100 and not necessarily the 10 files have the same number of observations.
so here for example the missing values and not necessarily appears in all files that i have, missing value are: 4,5 and 6 in C2 and the corresponding 1st column C1(measured in milliseconds, so the value of 928ms is not a time neighbor of 912ms). So i want to find those gaps(the total missing values in all 10 files) and show a histogram of their lengths.
i wrote a piece of code in R, but the problem is that i don't get the exact total number that i should have for the missing values.
path = "files path"
out.file<-data.frame(TS = 0, Index = 0, File = '')
file.names <- dir(path, pattern =".txt")
for(i in 1:length(file.names)){
file <- cbind(read.table(file.names[i],
header=F,
sep ="\t",
stringsAsFactors=FALSE),
file.names[i])
colnames(file) <- c('TS', 'Index', 'File')
out.file <- rbind(out.file, file)
}
d = dim(out.file)[1]
misDa = 0
for(i in 2:(d-1)){
if(abs(out.file$Index[i]-out.file$Index[i+1]) > 1)
misDa = misDa+1
}
Hard to give specific hints without having a more extensive example of your data that contains some of the actual NAs.
If you are using R (like it seems) the naniar and the imputeTS packages offer nice functions for missing data visualizations.
Some examples from the naniar package, which is especially good for multivariate data (more plot examples):
Some examples from the imputeTS package, which is especially good for time series data (additional plot examples):

Clustering to achieve heterogeneous groups

I want to group 100 users based on a categorical variable (which can be low, medium, or high). The group size should be 3. I want to get the maximal heterogeneity within groups, assuming that users are distributed equally. I wonder if I can use some clustering algorithm to group based on the dissimilarity? Any suggestions?
I don't believe you need a clustering algorithm to group the data based upon a categorical variable.
Based on you question, I think this should work.
# Code
from sklearn.model_selection import train_test_split
group1, group23 = train_test_split(data, test_size=2/3., stratify=data['lab'])
group2, group3 = train_test_split(group23, test_size=1/2., stratify=group23['lab'])
Stratify makes sure that the maximum heterogeneity is maintained for the given categorical value.
# Sample output
print(data)
val1 val2 lab
0 1 1 L
1 2 2 L
2 3 3 L
3 4 4 M
4 5 5 M
5 6 6 M
6 7 7 H
7 8 8 H
8 9 9 H
print(group1)
val1 val2 lab
4 5 5 M
1 2 2 L
6 7 7 H
print(group2)
val1 val2 lab
8 9 9 H
2 3 3 L
3 4 4 M
print(group3)
val1 val2 lab
0 1 1 L
7 8 8 H
5 6 6 M
train_test_split() Documentation

How does 3 modulo 7 = 3?

I was wondering how modulo works. I know how it works when the bigger number comes first, but not the opposite. I know that 7 % 3 = 1 as 3 goes up to 7 2 times and the remaining is 1. However, when it's 3 % 7. I have used the calculator it shows 3. Is this because 7 goes up to 3 zero times and the remaining is 3? Is that how it works?
Your reasoning is correct. Any time the divisor is larger than the dividend, the result of the modulo operation equals the dividend.
7*x + y = 3, x and y are int, and x >= 0,
what y = ?
yes, y = 3.
3 MOD 7 = 0R3
This is so because 3/7 is >0 but <1
Mod just means you take the remainder after performing the division. When you divide 3 by 7 you get 3= 0*7 + 3 which means that the remainder is 3.
7/3 -> 3 goes into 7 2 times, and 7 - (3x2) = 1, so your modulus is 1
3/7 -> 7 goes into 3 zero times, and 3 - 0 = 3, so here your modulus is 3

Apply function to each row in Torch

I know that tensors have an apply method, but this only applies a function to each element. Is there an elegant way to do row-wise operations? For example, can I multiply each row by a different value?
Say
A =
1 2 3
4 5 6
7 8 9
and
B =
1
2
3
and I want to multiply each element in the ith row of A by the ith element of B to get
1 2 3
8 10 12
21 24 27
how would I do that?
See this link: Torch - Apply function over dimension
(Thanks to Alexander Lutsenko for providing it. I just moved it to the answer.)
One possibility is to expand B as follow:
1 1 1
2 2 2
3 3 3
[torch.DoubleTensor of size 3x3]
Then you can use element-wise multiplication directly:
local A = torch.Tensor{{1,2,3},{4,5,6},{7,8,9}}
local B = torch.Tensor{1,2,3}
local C = A:cmul(B:view(3,1):expand(3,3))

I want to compute a variable in SPSS with "if"

I would like to create the last column.Thank you in advance!
You could try something like this:
/*************************************/.
DATA LIST FREE /v1 v2 v3 v4 v5.
BEGIN DATA
1 2 99 4 5
99 2 3 99 5
1 99 3 4 5
1 2 99 99 5
1 99 99 99 5
99 2 99 99 99
END DATA.
DATASET NAME DS1.
/*************************************/.
/* Solution1: Assumes v1 to v5 can hold any value from 1 to 5 */.
recode v1 to v5 (99,sysmis=sysmis) (else=copy).
do repeat v=v1 to v5.
if (any(v,1,4,5)) Target1=1.
if (any(v,2,3)) Target2=2.
end repeat.
compute TargetA=sum(Target1,Target2).
/* Solution2: Alternative solution which assumes v1 holds values 1 only v2 values 2 only ect... */.
recode v1 to v5 (99,sysmis=sysmis) (else=1).
compute TargetB=sum(any(1,v1,v4,v5)*1, any(1,v2,v3)*2).
exe.
If I understand you correctly:
Your input file contains 5 columns, 1 per channel
Each channel-specific column is filled with channel-specific identifier (1-5)
When the column is empty, that channel is not used / not relevant for that observation
You want to summarize the mix of channels used in new field (NewVar)
You want to use the IF statement in the SPSS syntax
The answer above by JigneshSutar does not seem to do this. Also, you do not need the do-repeat-loops but can do this in 3 lines (+EXECUTE.) of syntax (using the data generator in the answer by JigneshSutar):
IF (V1 = 1 & V4 = 4 & V5 = 5) NewVar = 1.
IF (V2 = 2 & V3 = 3) NewVar = 2.
IF (V1 = 1 & V2 = 2 & V3 = 3 & V4 = 4 & V5 = 5) NewVar = 3.
EXECUTE.
This syntax can easily be adjusted when the channel columns are filled with other values than the channel identifiers [1-5], for instance by using the missing function:
IF (MISSING(V1)=0 & MISSING(V4)=0 & MISSING(V5)=0) NewVar = 1.
IF (MISSING(V2)=0 & MISSING(V3)=0) NewVar = 2.
IF (MISSING(V1)=0 & MISSING(V2)=0 & MISSING(V3)=0& MISSING(V4)=0 & MISSING(V5)=0) NewVar = 3.
EXECUTE.

Resources