Merging files based on common fields - join

I'm trying to join three text files in similar formats based on common fields, while keeping the uncommon fields. Here's an example:
File1:
X
A 1
B 3
C 2
D 1
File2:
Y
A 3
C 2
E 3
File3:
Z
A 2
E 1
D 1
F 3
Merged:
X Y Z
A 1 3 2
B 3 - -
C 2 2 -
D 1 - 1
E - 3 1
F - - 3
It doesn't have to be a - where there's no corresponding value. The join command in this question https://unix.stackexchange.com/questions/43417/join-two-files-with-matching-columns works well except that it doesn't keep the uncommon fields.
Thank you.

join can't do what you're asking for, but here's a Python program that does:
#!/usr/bin/env python
import sys
files = map(open, sys.argv[1:]) # list of input files
headers = map(file.readline, files) # list of strings
headers = map(str.strip, headers)
blanks = ['-'] * len(headers)
data = {} # { rowname : [datum...] }
for ii, infile in enumerate(files): # read each file
for line in infile:
key, value = line.split()
if key not in data:
data[key] = blanks[:] # deep copy
data[key][ii] = value
print '\t', '\t'.join(headers)
for key, values in sorted(data.iteritems()):
print key, '\t', '\t'.join(values)

Related

Best way to parallelize computation over dask blocks that do not return np arrays?

I'd like to return a dask dataframe from an overlapping dask array computation, where each block's computation returns a pandas dataframe. The example below shows one way to do this, simplified for demonstration purposes. I've found a combination of da.overlap.overlap and to_delayed().ravel() as able to get the job done, if I pass in the relevant block key and chunk information.
Edit:
Thanks to a #AnnaM who caught bugs in the original post and then made it general! Building off of her comments, I'm including an updated version of the code. Also, in responding to Anna's interest in memory usage, I verified that this does not seem to take up more memory than naively expected.
def extract_features_generalized(chunk, offsets, depth, columns):
shape = np.asarray(chunk.shape)
offsets = np.asarray(offsets)
depth = np.asarray(depth)
coordinates = np.stack(np.nonzero(chunk)).T
keep = ((coordinates >= depth) & (coordinates < (shape - depth))).all(axis=1)
data = coordinates + offsets - depth
df = pd.DataFrame(data=data, columns=columns)
return df[keep]
def my_overlap_generalized(data, chunksize, depth, columns, boundary):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth, boundary=boundary)
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr,
chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2),
columns=['r', 'c', 'z', 'y', 'x'],
boundary=tuple(['reflect']*5))
df.compute().reset_index()
-- Remainder of original post, including original bugs --
My example only does xy overlaps, but it's easy to generalize. Is there anything below that is suboptimal or could be done better? Is anything likely to break because it's relying on low-level information that could change (e.g. block key)?
def my_overlap(data, chunk_xy, depth_xy):
data = data.rechunk((-1,-1,-1, chunk_xy, chunk_xy))
data_overlapping_chunks = da.overlap.overlap(data,
depth=(0,0,0,depth_xy,depth_xy),
boundary={3: 'reflect', 4: 'reflect'})
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features)(block, offsets=offsets, depth_xy=depth_xy)
dfs.append(df_block)
# All computation is delayed, so downstream comptutions need to know the format of the data. If the meta
# information is not specified, a single computation will be done (which could be expensive) at this point
# to infer the metadata.
# This empty dataframe has the index, column, and type information we expect in the computation.
columns = ['r', 'c', 'z', 'y', 'x']
# The dtypes are float64, except for a small number of columns
df_meta = pd.DataFrame(columns=columns, dtype=np.float64)
df_meta = df_meta.astype({'c': np.int64, 'r': np.int64})
df_meta.index.name = 'feature'
return dd.from_delayed(dfs, meta=df_meta)
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y+offsets[3]-depth_xy,
'x': x+offsets[4]-depth_xy})
df = df[(df.y > depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.z > depth_xy) & (df.z < (chunk.shape[4] - depth_xy))]
return df
data = np.zeros((2,4,8,16,16)) # round, channel, z, y, x
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap(arr, chunk_xy=8, depth_xy=2)
df.compute().reset_index()
First of all, thanks for posting your code. I am working on a similar problem and this was really helpful for me.
When testing your code, I discovered a few mistakes in the extract_features function that prevent your code from returning correct indices.
Here is a corrected version:
def extract_features(chunk, offsets, depth_xy):
r, c, z, y, x = np.nonzero(chunk)
df = pd.DataFrame({'r': r, 'c': c, 'z': z, 'y': y, 'x': x})
df = df[(df.y >= depth_xy) & (df.y < (chunk.shape[3] - depth_xy)) &
(df.x >= depth_xy) & (df.x < (chunk.shape[4] - depth_xy))]
df['y'] = df['y'] + offsets[3] - depth_xy
df['x'] = df['x'] + offsets[4] - depth_xy
return df
The updated code now returns the indices that were set to 1:
index r c z y x
0 0 0 0 4 2 2
1 1 0 1 4 6 2
2 2 0 3 4 2 2
3 1 1 2 4 8 2
For comparison, this is the output of the original version:
index r c z y x
0 1 0 1 4 6 2
1 3 1 2 4 8 2
2 0 0 1 4 6 2
3 1 1 2 4 8 2
It returns lines number 2 and 4, two times each.
The reason why this happens is three mistakes in the extract_features function:
You first add the offset and subtract the depth and then filter out the overlapping parts: the order needs to be swapped
df.y > depth_xy should be replaced with df.y >= depth_xy
df.z should be replaced with df.x, since it is the x dimension that has an overlap
To optimize this even further, here is a generalized version of the code that would work for an arbitrary number of dimension:
def extract_features_generalized(chunk, offsets, depth, columns):
coordinates = np.nonzero(chunk)
df = pd.DataFrame()
rows_to_keep = np.ones(len(coordinates[0]), dtype=int)
for i in range(len(columns)):
df[columns[i]] = coordinates[i]
rows_to_keep = rows_to_keep * np.array((df[columns[i]] >= depth[i])) * \
np.array((df[columns[i]] < (chunk.shape[i] - depth[i])))
df[columns[i]] = df[columns[i]] + offsets[i] - depth[i]
del coordinates
return df[rows_to_keep > 0]
def my_overlap_generalized(data, chunksize, depth, columns):
data = data.rechunk(chunksize)
data_overlapping_chunks = da.overlap.overlap(data, depth=depth,
boundary=tuple(['reflect']*len(columns)))
dfs = []
for block in data_overlapping_chunks.to_delayed().ravel():
offsets = np.array(block.key[1:]) * np.array(data.chunksize)
df_block = dask.delayed(extract_features_generalized)(block, offsets=offsets,
depth=depth, columns=columns)
dfs.append(df_block)
return dd.from_delayed(dfs)
data = np.zeros((2,4,8,16,16))
data[0,0,4,2,2] = 1
data[0,1,4,6,2] = 1
data[1,2,4,8,2] = 1
data[0,3,4,2,2] = 1
arr = da.from_array(data)
df = my_overlap_generalized(arr, chunksize=(-1,-1,-1,8,8),
depth=(0,0,0,2,2), columns=['r', 'c', 'z', 'y', 'x'])
df.compute().reset_index()

How to return a count of fields with a given value in a record?

I have a database table with the following fields :
---------------------
FIELDS : | H1 | H2 | H3 | H4
---------------------
VALUES : | A | B | A | C
---------------------
For a given record (row), I would like to count the number of fields with a value of A. In the above, for example, there are two fields with a value of A, so the expected result would be : 2
How can I achieve this?
I am trying to answer the question from a database point of view.
You have a table with one or more rows and every row has in the four columns either an 'A' or something else. For a given row (or for many rows) you want to get the number of columns that have an 'A' in it.
As one commenter pointed out you can't sum letters but you can check whether or not a value is the one you are looking for and then count this occurence as a 1 or 0. Finally sum those values and return the sum.
SELECT (CASE H1 WHEN 'A' THEN 1 ELSE 0 END) +
(CASE H2 WHEN 'A' THEN 1 ELSE 0 END) +
(CASE H3 WHEN 'A' THEN 1 ELSE 0 END) +
(CASE H4 WHEN 'A' THEN 1 ELSE 0 END) AS number_of_a
FROM name_of_your_table;
For your example row this will return:
NUMBER_OF_A
===========
2
If you have more than one row you'll get the number of As for every row.
I test this it work Thanx for help.
SELECT count(H1) + count(H2) + count(H3) + count(H4) + count(H5) +
count(H6) + count(H7) + count(H8) as TOT
from Table T
where T.H1 = 'A' or T.H2 = 'A' or T.H3 = 'A' or T.H4 = 'A'
or T.H5 = 'A' or T.H6 = 'A' or T.H7 = 'A' or T.H8 = 'A'
group by T.ID
order by 1 DESC
Other solution ...

How to get value from the table in Lua

I have table with with multiple values and I want to print each of them.
To be like:
'value_1' 'value_2' etc..
table = {
{'value_1'},
{'value_2'},
{'value_3'},
{'value_4'},
}
I tried with for k, v but I failed:
for k, v in pairs(table) do
print(v)
end
The values of your table are tables themselves. So try this instead:
for k, v in pairs(table) do
print(v[1])
end
Or create a simpler table and use your original code:
table = {
'value_1',
'value_2',
'value_3',
'value_4',
}
I'm unsure if your example was meant to be production code or not, but there are a few optimizations (although small) you could make:
-Make the table a local variable (i.e): local table = {};
-Remove the unnecessary tables (i.e): {'value1'}; >> 'value1';
-Change the k,v loop to a generic for loop (I believe that would be more efficient?).
Final code (as I would put it):
local Table = {
"value_1";
"value_2";
"value_3";
"value_4";
};
for Key = 1, #Table, 1 do
print(Table[Key]);
end;
Feel free to ask any questions. Oh, and if you're planning on running this code many times, consider putting local print = print; above your code to define a local variable (they are faster).
You are working with multidimensional arrays when you have sub-tables. You can index a sub table like below.
local tab = {
{1, 2, 3},
{4, 5, 6},
{7, 8, 9}
}
for i, v in next, tab do
print(i, v)
for n, k in next, v do
print(">", n, k)
end
end
-- 1 table: 000001
-- > 1 1
-- > 2 2
-- > 3 3
-- 2 table: 000002
-- > 1 4
-- > 2 5
-- > 3 6
-- 3 table: 000003
-- > 1 7
-- > 2 8
-- > 3 9
To index the table above without for loops, you can use the []'s.
print(tab[1][1]) --> 1
print(tab[1][2]) --> 2
print(tab[2][1]) --> 4
print(tab[2][2]) --> 5
You are NOT restricted to number indices. You can use strings and a special way to index with them.
local tab = {
x = 5,
y = 10,
[3] = 15
}
print(tab.x, tab["y"], tab[3]) --> 5 10 15

Lua 3 point crossover help to start

I want to implement a 3 point crossover for genetic programming but I don't know how to do it and where to start.
My input is:
a = {(first pair), (second pair), ... etc.}
For example a = {(12345,67890), (09876,54321)} (those are numbers, not strings)
Output: Something like this:
Example: a_1 = {(12895), (67340)} also numbers.
Thanks for reply and sorry for my bad English.
Here is my quick implementation of k-point crossover for integers using mostly integer arithmetic. Starting with this, you can extend it to crossover your chromosomes of many pairs of integers using a loop.
math.randomseed(111)
-- math.randomseed(os.time())
a = 12345
b = 67890
len = 5 -- number of digits
function split(mum, dad, len, base)
local split = math.pow(base, math.random(len))
local son = math.floor(dad / split) * split + mum % split
local daughter = math.floor(mum / split) * split + dad % split
return son, daughter
end
function kpoint(mum, dad, len, base, k)
for i=1, k do
mum, dad = split(mum, dad, len, base)
end
return mum, dad
end
s, d = kpoint(a, b, len, 10, 3) -- 3 point crossover in base 10
print(s) -- 67395
print(d) -- 12840
-- binary, (crossover binary representation)
s, d = kpoint(tonumber("10001", 2), tonumber("10110", 2), 5, 2, 3)
print(s) -- 23 which is (10111) in base 2
print(d) -- 16 which is (10000) in base 2
-- binary, (crossover base 10, but interpret as binary)
s, d = kpoint(1101, 1010, 4, 10, 3)
print(s) -- 1001
print(d) -- 1110

Ruby on Rails - Calculate Size of Number Range

Forgive my lack of code but I can't quite figure out the best way to achieve the following:
two strings (stored as strings because of the leading 0 - they are phone numbers) :
a = '0123456700'
b = '0123456750'
I am trying to find a way to write them as a range as follows
0123456700 - 750
rather than
0123456700 - 0123456750
which I currently have.
It's not as straightforward as getting the last 3 digits of b since the range can vary and perhaps go up to 4 digits so I'm trying to find the best way of being able to do this.
I'd look up the index of the first unequal pair of characters:
a = '0123456700'
b = '0123456750'
index = a.chars.zip(b.chars).index { |x, y| x != y }
#=> 8
And extract the suffix with:
"#{a} - #{b[index..-1]}" if index
#=> "0123456700 - 50"
Here's a method that returns the range:
def my_range(a, b)
a = a.delete(" ") # remove all spaces from string
b = b.delete(" ")
a, b = b, a if a.to_i > b.to_i # a is always smaller than b
ai, bi = a.to_i, b.to_i
pow = 1
while ai > 1
pow += 1
len = pow if ai % 10 != bi % 10
ai /= 10
bi /= 10
end
a + " - " + b[-len..-1]
end
puts my_range("0123456700", "0123456750") # 0123456700 - 750
puts my_range("0123456669", "0123456675") # 0123456669 - 675
puts my_range("0123400200", "0123500200") # 0123400200 - 3500200
puts my_range("012 345 678", "01 235 0521") # 012345678 - 350521
From my personal library (simplified):
def common_prefix first, second
i = 0
loop{break unless first[i] and second[i] == first[i]; i += 1}
first[0, i]
end
a = "0123456700"
b = "0123456750"
c = "0123457750"
common_prefix(a, b)
# => "01234567"
"#{a} - #{b.sub(common_prefix(a, b), "")}"
# => "0123456700 - 50"
"#{a} - #{c.sub(common_prefix(a, c), "")}"
# => "0123456700 - 7750"
Note. This will work correctly only under the assumption that all strings are right padded with 0 to be the same length.

Resources