What is the difference between using as="element(data)+" and as="element(data)" in xsl:variable. The below XSL solution works if use "+" but not when i use "". Can some one clarify.
element(data)+
means a sequence of one or more data elements. That is, the sequence cannot be empty.
element(data)*
means a sequence of zero or more `data elements. That is, the sequence can be empty.
Related
I know that when we implement a ParDo transform, we pick up individual elements from our data(basically separated by "\n"). But what if I have an element that occupies two lines in my file. Can I apply my own condition to pick elements according to it? Or is it always necessary to have an element in a single line?
Reading of text files is controlled by TextIO, not by ParDo - I suppose that's what you meant. Indeed right now TextIO splits files into 1 element per line, however there is work in progress on changing that. You can follow the work at https://issues.apache.org/jira/browse/BEAM-2802.
It would be useful for that work, if you told more about your file format, to make sure it is in scope.
Being a novel on SPSS I am struggling with finding duplicate cases based on a string-variable in a dataset containing approx 33,000 cases.
I have a variable named "nr" that is supposed to be unique id for every case. However, it turns out that some cases might have two different values in "nr" entered,the only difference being the last character. Resulting in a case being shown as two separate rows.
The structure of the var "nr" is a as follows: XX-XXXXXXX-X or X-XXXXXXX-X i.e 2-7-1 characters or 1-7-1 characters.
I would like to sort out all cases that have a "nr" equal to another case except for the last character.
To illustrate, with a succesfull syntax I would hopefully be able to sort cases like these out from the whole dataset:
20-4026988-2
20-4026988-3
5-4026992-5
5-4026992-8
20-4027281-2
20-4027281-3
Anyone have an idea on how to make a syntax for this? Would be so grateful for any input!
I suggest to create a new variable without that last character, and then look for the doubles:
* first creating some sample data to play with.
data list list/ID (a15).
begin data.
20-4026988-2
12-2345678-7
20-4026988-3
5-4026992-5
5-4026992-8
12-1234567-1
20-4027281-2
6-1234567-1
20-4027281-3
end data.
* now creating the new variable and counting the occurrences of each shortened ID.
string ShortID (a15).
compute ShortID=char.substr(ID,1,char.rindex(ID,"-")).
* also possible: compute ShortID=char.substr(ID,1,char.length(rtrim(ID))-1).
aggregate out=* mode=add /break=ShortID/occurrences=n.
* at this point you can filter based on the number or `occurrences` or sort them.
sort cases by occurrences (d) ShortID.
After removing the last character, you can use Data > Identify Duplicate Cases to find the dups. It as a number of useful options for this.
Lets say I have a massive string of just a single character say x. I need to use huffman encoding.
A huffman encoding is a fully binary tree. So how does one create a huffman code for just a single character when we dont need two leaves at all ?
jbr's answer is fine; this is just a longer version of it.
Huffman is meant to produce a minimal-length sequence of bits that contains all the information in the original sequence of symbols, assuming that the decoder already knows the set of symbols. If there's only one symbol, the input data contains no information except its length.
In Huffman-based data formats, length is usually encoded separately, not as part of the Huffman-encoded bit sequence itself. The decoder of a single-symbol Huffman code therefore has all the information it needs to reconstruct the input without needing to read anything from the Huffman-encoded bit sequence. it is logical, then, that the Huffman encoder's output should be 0 bits long.
If you don't have a length encoded separately, then you must have a symbol to represent End Of Sequence so the decoder knows when to stop reading. Then your Huffman tree will have 2 nodes and you won't run into this special case.
If you only have one symbol, then you only need 1 bit per symbol. So you really don't have to do anything except count the number of bits and translate each into your symbol.
You simply could add an edge case in your code.
For example:
check if there is only one character in your hash table, which returns only the root of the tree without any leafs. In this case, you could add a code for this root node in your encoding function, like 0.
In the encoding function, you should refer to this edge case too.
I was wondering about the behavior of this code:
str = "abcd"
print( str:find"a(bc)d" ) -- prints 1 4 bc
print( str:find"(ab)cd" ) -- prints 1 4 ab
Even though both of the two lines are looking for, and return, different strings, they return the same indices because they have the same frame of reference. In other words, the captures are ignored when calculating the indices, but then they are returned normally.
My original question was going to be about what went wrong, but then I saw that the manual actually indicates that this is proper behavior (though it isn't very clear).
The problem was that I was trying to find something based on a marker near it, without returning the position of that marker. I expected string.find to return the position of the first capture, if there was one, so I just wrapped the part I wanted the position of with parenthesis. Obviously, that didn't help. I found a different (and better) solution to the problem, but I don't think that is always possible or convenient.
Is there any reason for string.find to behave this way? Is there any particular benefit for users? If you have absolute mastery of Lua: is there actually no case where this causes a serious problem?
Captures are a byproduct of matching. Even when you give a patten that has captures, you are still interested in matching the whole pattern. In other words, matching answers the question: where in the given string does this subtext appear? Captures are just extra bits of information about the match.
string.find returns the location of the match to allow you (for instance) to continue parsing the string after the match, possibly with a different pattern.
I have 2D array in which the second column has domain names of some emails, let us call the array myData[][]. I decided to use ArrayLib in order to search the second column for a specific domain.
ArrayLib.indexOf(myData, 1, domain)
Here is where I found an issue. In myData array, one of the domains look like this "ewmining.com" (pay attention to the w).
While searching for "e.mining.com" (notice the first dot), the indexOf() function actully gave me the row containing "ewmining.com".
This is what is in the array "ewmining.com"
This is what is in the serach string "e.mining.com"
It seams that ArrayLib treats the dot to mean any character. Is this supposed to be the correct behavior? Is there a way to stop this behavior and search for exact match.
I really need help on this issue.
Thanks in advance for your help.
The dot usually represents "any character" in regular expressions. I am not familiar with ArrayLib, but maybe you should look for a way to turn off regular expressions when searching. Otherwise you might have to escape the dot, for example search for e[.]mining[.]com