full outer join with awk

full outer join with awk - join

After reading:
Combine two files with unequal length on common column with multiple matches with linux command line
I wonder how you would do a full outer join.
(hopefully it's ok to start a new question with it)
file one
A 1
C 4
file two
A 2
B 5
file three
A 7
D 9
the result would be:
A 1 2 7
B N 5 N
C 4 N N
D N N 9
Is there an awk-one-line-solution like I saw for left outer join?

With GNU awk for true multi-dimensional arrays, ARGIND, and sorted_in:
$ cat tst.awk
{ vals[$1][ARGIND] = $2 }
END {
PROCINFO["sorted_in"] = "#ind_str_asc"
for (key in vals) {
printf "%s%s", key, OFS
for (fileNr=1; fileNr<=ARGIND; fileNr++) {
val = (fileNr in vals[key] ? vals[key][fileNr] : "N")
printf "%s%s", val, (fileNr<ARGIND ? OFS : ORS)
}
}
}
$ awk -f tst.awk file1 file2 file3
A 1 2 7
B N 5 N
C 4 N N
D N N 9

It's possible to solve this using the POSIX-standard join command. Given that the three files proposed in the question are named 1.txt, 2.txt and 3.txt:
join -a1 -a2 -eN -o 0,1.2,2.2 1.txt 2.txt | join -a1 -a2 -eN -o 0,1.2,1.3,2.2 - 3.txt
Note that while this provides the required output for the question as stated, it's quite a bit less flexible than Ed Morton's awk-based solution. For one thing, join needs its input files to be sorted on the shared field(s). And for another, it will only work for exactly three input files.
However, it's possibly simpler to use join in once-off ad hoc cases!

Related

How to count locals in ANS-Forth?

While developing BigZ, mostly used for number theoretical experiments, I've discovered the need of orthogonality in the word-set that create, filter or transform sets. I want a few words that logically combinated cover a wide range of commands, without the need to memorize a large number of words and ways to combinate them.
1 100 condition isprime create-set
put the set of all prime numbers between 1 and 100 on a set stack, while
function 1+ transform-set
transform this set to the set of all numbers p+1, where p is a prime less than 100.
Further,
condition sqr filter-set
leaves the set of all perfect squares on the form p+1 on the stack.
This works rather nice for sets of natural numbers, but to be able to create, filter and transform sets of n-tuples I need to be able to count locals in unnamed words. I have redesigned words to shortly denote compound conditions and functions:
: ~ :noname ;
: :| postpone locals| ; immediate
1 100 ~ :| p | p is prime p 2 + isprime p 2 - isprime or and ;
1 100 ~ :| a b | a dup * b dup * + isprime ;
Executing this two examples gives the parameter stack ( 1 100 xt ) but to be able to handle this right, in the first case a set of numbers and in the second case a set of pairs should be produced, I'll have to complement the word :| to get ( 1 100 xt n ) where n is the numbet of locals used. I think one could use >IN and PARSE to do this, but it was a long time ago I did such things, so I doubt I can do it properly nowadays.

I didn't understand (LOCALS) but with patience and luck I managed to do it with my original idea:
: bl# \ ad n -- m
over + swap 0 -rot
do i c# bl = +
loop negate ;
\ count the number of blanks in the string ad n
variable loc#
: locals# \ --
>in # >r
[char] | parse bl# loc# !
r> >in ! ; immediate
\ count the number of locals while loading
: -| \ --
postpone locals#
postpone locals| ; immediate
\ replace LOCALS|
Now
: test -| a b | a b + ;
works as LOCALS| but leave the number of locals in the global variable loc#.

Maybe you should drop LOCALS| and parse the local variables yourself. For each one, call (LOCAL) with its name, and end with passing an empty string.
See http://lars.nocrew.org/dpans/dpans13.htm#13.6.1.0086 for details.

How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?

I posed the question generically, because maybe it is a generic answer. But a specific example is comparing 2 BigQuery tables with the same schema, but potentially different data. I want a diff, i.e. what was added, deleted, modified, with respect to a composite key, e.g. the first 2 columns.
Table A
C1 C2 C3
-----------
a a 1
a b 1
a c 1
Table B
C1 C2 C3 # Notes if comparing B to A
-------------------------------------
a a 1 # No Change to the key a + a
a b 2 # Key a + b Changed from 1 to 2
# Deleted key a + c with value 1
a d 1 # Added key a + d
I basically want to be able to make/report the comparison notes.
Or from a Beam perspective I may want to Just output up to 4 labeled PCollections: Unchanged, Changed, Added, Deleted. How do I do this and what would the PCollections look like?

What you want to do here, basically, is join two tables and compare the result of that, right? You can look at my answer to this question, to see the two ways in which you can join two tables (Side inputs, or CoGroupByKey).
I'll also code a solution for your problem using CoGroupByKey. I'm writing the code in Python because I'm more familiar with the Python SDK, but you'd implement similar logic in Java:
def make_kv_pair(x):
""" Output the record with the x[0]+x[1] key added."""
return ((x[0], x[1]), x)
table_a = (p | 'ReadTableA' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysA' >> beam.Map(make_kv_pair)
table_b = (p | 'ReadTableB' >> beam.Read(beam.io.BigQuerySource(....))
| 'SetKeysB' >> beam.Map(make_kv_pair))
joined_tables = ({'table_a': table_a, 'table_b': table_b}
| beam.CoGroupByKey())
output_types = ['changed', 'added', 'deleted', 'unchanged']
class FilterDoFn(beam.DoFn):
def process((key, values)):
table_a_value = list(values['table_a'])
table_b_value = list(values['table_b'])
if table_a_value == table_b_value:
yield pvalue.TaggedOutput('unchanged', key)
elif len(table_a_value) < len(table_b_value):
yield pvalue.TaggedOutput('added', key)
elif len(table_a_value) > len(table_b_value):
yield pvalue.TaggedOutput('removed', key)
elif table_a_value != table_b_value:
yield pvalue.TaggedOutput('changed', key)
key_collections = (joined_tables
| beam.ParDo(FilterDoFn()).with_outputs(*output_types))
# Now you can handle each output
key_collections.unchanged | WriteToText(...)
key_collections.changed | WriteToText(...)
key_collections.added | WriteToText(...)
key_collections.removed | WriteToText(...)

Creating numbered variable names using the foreach command

I have a list of variables for which I want to create a list of numbered variables. The intent is to use these with the reshape command to create a stacked data set. How do I keep them in order? For instance, with this code
local ct = 1
foreach x in q61 q77 q99 q121 q143 q165 q187 q209 q231 q253 q275 q297 q306 q315 q324 q333 q342 q351 q360 q369 q378 q387 q396 q405 q414 q423 {
gen runs`ct' = `x'
local ct = `ct' + 1
}
when I use the reshape command it generates an order as
runs1 runs10 runs11 ... runs2 runs22 ...
rather than the desired
runs01 runs02 runs03 ... runs26
Preserving the order is necessary in this analysis. I'm trying to add a leading zero to all ct values less than 10 when assigning variable names.

Generating a series of identifiers with leading zeros is a documented and solved problem: see e.g. here.
local j = 1
foreach v in q61 q77 q99 q121 q143 q165 q187 q209 q231 q253 q275 q297 q306 q315 q324 q333 q342 q351 q360 q369 q378 q387 q396 q405 q414 q423 {
local J : di %02.0f `j'
rename `v' runs`J'
local ++j
}
Note that I used rename rather than generate. If you are going to reshape the variables afterwards, the labour of copying the contents is unnecessary. Indeed the default float type for numeric variables used by generate could in some circumstances result in loss of precision.
I note that there may also be a solution with rename groups.
All that said, it's hard to follow your complaint about what reshape does (or does not) do. If you have a series of variables like runs* the most obvious reshape is a reshape long and for example
clear
set obs 1
gen id = _n
foreach v in q61 q77 q99 q121 q143 {
gen `v' = 42
}
reshape long q, i(id) j(which)
list
+-----------------+
| id which q |
|-----------------|
1. | 1 61 42 |
2. | 1 77 42 |
3. | 1 99 42 |
4. | 1 121 42 |
5. | 1 143 42 |
+-----------------+
works fine for me; the column order information is preserved and no use of rename was needed at all. If I want to map the suffixes to 1 up, I can just use egen, group().
So, that's hard to discuss without a reproducible example. See
https://stackoverflow.com/help/mcve for how to post good code examples.

searching multiple words from a .tr file

I need to calculate the received packets from .tr file. The problem is that one string is necessary for me but some unnecessary events are also counted.
So I want a solution.
line1: r 0.500000000 1 RTR — 0 cbr 210 [0 0 0 0] ——- [1:0 5:0 32 0] [0] 0 0
line2: r 0.501408175 3 RTR — 0 AODV 48 [0 ffffffff 1 800] ——- [1:255 -1:255 30 0] [0x2 1 1 [5 0] [1 4]] (REQUEST)
I want only line 1 but as I am searching for ‘^r’ only so both files are returned. Please help me how can I search the line where 2 patterns are needed?

You can use grep to match more than one expression at a time, or to return results if one of the expressions exist.
OR alike - if foo or bar matches:
grep -e "foo|bar" file
AND alike - if foo and bar matches:
grep "foo" file | grep "bar"
As you need 2 patterns, I will go for the latest. But still, I think you may improve your question putting an example of what you expect your code to return, and what exact command are you currently using.
You may also define a grep command to find the occurrences of foo, but that does not include bar. Maybe that is easier, depending on what you need.
grep "foo" file | grep -v "bar"

Traversing the graphdb to the nth degree

I'm at a loss.
Scenario: get depth of 2 from Joe (#'s represent 'Person'. Letters represent 'Position')
0 E
| |
1 B
/ | \ / | \
2 JOE 3 C A D
/|\ /|\
0 0 0 F G H
/\ | | /\ | |
0 0 0 0 I J K L
Catch is, a person is tied to a position. Position has relationships to each other but a person doesn't have a relationship to another person. so it goes something like:
Joe<-[:occupied_by]-(PositionA)-[:authority_held_by]->
(PositionB)-[:occupied_by]->Sam
This query:
Match (:Identity {value:"1234"})-[:IDENTIFIES]->(posStart:Position)
-[:IS_OCCUPIED_BY]->(perStart:Person)
Optional Match p=(perStart)<-[:IS_OCCUPIED_BY]-(posStart)
-[r:AUTHORITY_HELD_BY*..1]-(posEnd:Position)-[:IS_OCCUPIED_BY]->
(perEnd:Person) Return p
does get me what I need but always returns the first column as the original node it started with (perStart). I want to return it in a way where the first two column always represent the start node and the second column to represent the end node.
PositionA, PositionB (and we can infer this means A-[:authority_held_by]->B
If we had bi directional relationship, such as, A-[:authority_held_by]->B and B-[:manages]->A
I wouldn't mind what's in the first or the second column as we can have the third column represent the relationship
PositionB, PositionA, [:manages]
but we are trying to stay away from bi-directional relationship
Ultimately I want something like:
PositionA, PositionB (inferring, A-[:A_H_B]->B)
PositionB, PositionE (inferring, B-[:A_H_B]->E)
PositionF, PositionA (inferring, F-[:A_H_B]->A)
PositionG, PositionA (inferring, G-[:A_H_B]->A)
Is this possible with cypher or do I have to do some black magic? :)
I hope I explained throughly and understandably.. Thank you so much in advance!

would replacing Return p with-
RETURN nodes(p)[0] AS START, LAST(nodes(p)) as last
work?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

full outer join with awk - join

Related

How to count locals in ANS-Forth?

How do I perform a "diff" on two Sources given a key using Apache Beam Python SDK?

Creating numbered variable names using the foreach command

searching multiple words from a .tr file

Traversing the graphdb to the nth degree

Categories

Resources