parsing using awk - parsing

how to parse a file based on data from another file using awk.
i made a script:
BEGIN{ FS="\t" ; OFS="\t"
while((getline<"headfpkm")>0) {
++a
id[a]=$1
fpkm[a]=$2
print id[a],fpkm[a]
}
lastid=id[a]
print lastid
close("headfpkm")
}
/$lastid/{
print $2,$3,$5,$7,$8,$14,fpkm[a]
a--
lastid=id[a]
}
END{ print "total lines=",FNR,"\n\nfile 1 index: ",a}
when i run it :
/$ awk -f testawk.awk file2
it runs the BEGIN section properly but doesnt give any output.
NM_000014 5.04503
NM_000015 0.586677
NM_000016 1.138332278
NM_000017 0.64386
NM_000018 3.61746
NM_000019 2.8793
NM_000020 10.846
NM_000021 0.685098
NM_000022 46388.6
NM_000026 0.257471
NM_000026
total lines= 10
file 1 index: 10
Is anything wrong with the searching section??
file 2 looks like this:
34 ACADM NM_000016 9606 hsa-miR-3148 3 80 87 0.003 -0.016 -0.094 0.082 0.112 -0.160 97
34 ACADM NM_000016 9606 hsa-miR-3163 1 623 629 0.001 -0.022 -0.020 0.065 0.125 -0.01 57
35 ACADS NM_000017 9606 hsa-miR-3921 3 68 75 0.013 0.192 -0.097 0.031 -0.039 -0.147 82
35 ACADS NM_000017 9606 hsa-miR-4303 2 67 73 0.012 0.150 -0.052 0.013 -0.039 -0.036 31
35 ACADS NM_000017 9606 hsa-miR-4653-5p 3 68 75 0.003 0.192 -0.097 0.031 -0.039 -0.157 84
37 ACADVL NM_000018 9606 hsa-miR-124 2 31 37 0.003 0.023 -0.057 0.012 -0.032 -0.171 76
37 ACADVL NM_000018 9606 hsa-miR-1827 2 135 141 -0.007 -0.043 -0.058 0.039 -0.069 -0.258 91
37 ACADVL NM_000018 9606 hsa-miR-2682 2 134 140 0.003 -0.014 -0.058 0.004 -0.047 -0.232 87
37 ACADVL NM_000018 9606 hsa-miR-449c 2 134 140 -0.035 -0.014 -0.058 0.004 -0.047 -0.270 92
37 ACADVL NM_000018 9606 hsa-miR-506 2 31 37 -0.016 0.023 -0.057 0.012 -0.032 -0.190 80

This is going to be a bit of guess, because I'm not 100% sure as to what you're trying to accomplish. The better way to solve your problem, would be to do something like this:
BEGIN {
FS=OFS="\t"
}
FNR==NR {
c++
a[$1]=$2
next
}
$3 in a {
print $2,$3,$5,$7,$8,$14,a[$3]
}
END {
printf "total lines=%s\n\nfile 1 index: %s\n", FNR, c
}
Run like:
awk -f script.awk headfpkm file2
Results:
ACADM NM_000016 hsa-miR-3148 80 87 -0.160 1.138332278
ACADM NM_000016 hsa-miR-3163 623 629 -0.01 1.138332278
ACADS NM_000017 hsa-miR-3921 68 75 -0.147 0.64386
ACADS NM_000017 hsa-miR-4303 67 73 -0.036 0.64386
ACADS NM_000017 hsa-miR-4653-5p 68 75 -0.157 0.64386
ACADVL NM_000018 hsa-miR-124 31 37 -0.171 3.61746
ACADVL NM_000018 hsa-miR-1827 135 141 -0.258 3.61746
ACADVL NM_000018 hsa-miR-2682 134 140 -0.232 3.61746
ACADVL NM_000018 hsa-miR-449c 134 140 -0.270 3.61746
ACADVL NM_000018 hsa-miR-506 31 37 -0.190 3.61746
total lines=10
file 1 index: 10

Related

Calculating the most frequent pairs in a dataset

Is it possible to calculate most frequent pairs from a combinations of pairs in a dataset with five columns?
I can do this with a macro in excel, I'd be curious to see if there's a simple solution for this in google sheets.
I have a sample data and results page here :
Data:
B1 B2 B3 B4 B5
6 22 28 32 36
7 10 17 31 35
8 33 38 40 42
10 17 36 40 41
8 10 17 36 54
9 30 32 51 55
1 4 16 26 35
12 28 30 40 43
42 45 47 49 52
10 17 30 31 47
10 17 33 51 58
4 10 17 30 32
2 35 36 37 43
6 10 17 38 55
3 10 17 25 32
Results would be like:
Value1 Value2 Frequency
10 17 8
10 31 2
17 31 2
10 36 2
17 36 2
30 32 2
10 30 2
17 30 2
10 32 2
17 32 2
etc
Each row represents a data set. The pairs don't have to be adjoining. There can be numbers between them.
Create a combination of pairs for each row using the method mentioned here. Then REDUCE all the pairs to create a virtual 2D array. Then use QUERY to group and find the count:
=QUERY(
REDUCE(
{"",""},
A2:A16,
LAMBDA(acc,cur,
{
acc;
QUERY(
LAMBDA(mrg,
REDUCE(
{"",""},
SEQUENCE(COLUMNS(mrg)-1,1,0),
LAMBDA(a_,c_,
{
a_;
LAMBDA(rg,
REDUCE(
{"",""},
OFFSET(rg,0,1,1,COLUMNS(rg)-1),
LAMBDA(a,c,{a;{INDEX(rg,1),c}})
)
)(OFFSET(mrg,0,c_,1,COLUMNS(mrg)-c_))
}
)
)
)(OFFSET(cur,0,0,1,5)),
"where Col1 is not null",0
)
}
)
),
"Select Col1,Col2, count(Col1) group by Col1,Col2 order by count(Col1) desc "
)
Input:
B1(A1)
B2
B3
B4
B5
6
22
28
32
36
7
10
17
31
35
8
33
38
40
42
10
17
36
40
41
8
10
17
36
54
9
30
32
51
55
1
4
16
26
35
12
28
30
40
43
42
45
47
49
52
10
17
30
31
47
10
17
33
51
58
4
10
17
30
32
2
35
36
37
43
6
10
17
38
55
3
10
17
25
32
Output(partial):
count
10
17
8
10
30
2
10
31
2
10
32
2
10
36
2
17
30
2
17
31
2
17
32
2
17
36
2
30
32
2
1
4
1
1
16
1
1
26
1
1
35
1
2
35
1
2
36
1
2
37
1
2
43
1
3
10
1
3
17
1
3
25
1

Export datas from mongodb using mongoexport from a specific time range with query and timestamp

I'm trying to export a collection from mongodb using mongoexport. This works so far:
mongoexport.exe --db dataloggin --collection p1 --out myRecords.json
The problem is that the file is huge and I can not open it anymore (araound 20GB, 30 days, every half a second a document).
I just need the data between the 3. March and the 5 March, so I tried to select this date range with the query selector as following:
mongoexport.exe --db dataloggin --collection p1 -q='{"timestamp":{"$gte":{"$timestamp":"2016-03-3T00:00:00.000Z"}:},"timestamp":{"$lt":{"$timestamp":"2016-03-05T00:00:00.000Z"}}}' --out myRecords.json
But I get an error:
error validating settings: query '[39 123 116 105 109 101 115 116 97 109 112 58 123 36 103 116 101 58 123 36 116 105 109 101 115 116 97 109 112 58 50 48 49 54 45 48 49 45 48 49 84 48 48 58 48 48 58 48 48 46 48 48 48 90 125 58 125 44 116 105 109 101 115 116 97 109 112 58 123 36 108 116 58 123 36 116 105 109 101 115 116 97 109 112 58 50 48 49 54 45 48 49 45 48 49 84 48 48 58 48 48 58 48 48 46 48 48 48 90 125 125 125 39]' is not valid JSON: json: cannot unmarshal string into Go value of type map[string]interface {}
Someone have an idea?
Many Thanks and regards

How to overcome with this error when using Networkx's kernighan_lin_bisection

I want to use kernighan_lin_bisection from Networkx to separate a network data.
But the error below showed up and I'm stuck.
It would be highly appreciated if you could help me overcome this error.
QT-------------------------------------------------------------------------
IndexError Traceback (most recent call last)
in ()
17 for c in init_partition:
18 for n in c:
---> 19 color_map_i[n]=colors[counter]
20 counter=counter+1
21
IndexError: list assignment index out of range
UNQT---------------------------------------------------------------------------
The coding I used and data source"200224_04_act.prn" are below.
QT---------------------------------------------------
G=nx.read_edgelist("200224_04_act.prn",nodetype=int)
colors=["red","blue","green"]
pos=nx.spring_layout(G)
init_nodes=np.array_split(G.nodes(),2)
init_partition=[set(init_nodes[0]),set(init_nodes[1])]
print(init_partition)
from networkx.algorithms.community import kernighan_lin_bisection
color_map_i=["black"]*nx.number_of_nodes(G)
print(color_map_i)
counter=0
for c in init_partition:
for n in c:
color_map_i[n]=colors[counter]
counter=counter+1
print(color_map_i)
nx.draw_networkx_edges(G,pos)
nx.draw_networkx_nodes(G,pos,node_color=color_map_i)
nx.draw_networkx_labels(G,pos)
plt.axis("off")
plt.show()
lst_b=kernighan_lin_bisection(G,partition=init_partition)
color_map_b=["black"]*nx.number_of_nodes(G)
counter=0
for c in lst_b:
for n in c:
color_map_b[n]=colors[counter]
counter=counter+1
nx.draw_networkx_edges(G,pos)
nx.draw_networkx_nodes(G,pos,node_color=color_map_b)
nx.draw_networkx_labels(G,pos)
plt.axis("off")
plt.show()
UNQT--------------------------------------------------------------
"200224_04_act.prn" below.(Number of nodes is around 2000 but I made it
small due to the limit of number of character)
1 415
2 415
3 415
3 1350
4 1351
5 1352
6 383
7 993
8 1353
9 887
10 887
11 887
12 887
13 887
14 1185
15 1185
16 1185
17 1185
18 1185
19 1146
20 1146
21 1146
22 1146
21 776
23 776
24 707
25 707
26 707
27 707
28 707
29 754
21 754
30 754
31 754
32 754
33 778
34 778
35 778
36 778
37 778
38 859
39 859
40 1354
41 563
42 563
43 563
44 563
45 563
46 1209
47 1209
48 1209
49 1209
50 1209
51 715
52 715
53 715
54 715
55 715
56 1048
57 1048
58 1047
59 1047
60 1047
61 1047
62 1047
63 718
64 718
65 718
66 718
67 718
68 947
17 947
69 947
70 889
71 744
72 744
73 744
74 744
75 744
76 1137
77 1137
78 1137
79 1137
80 612
81 612
82 612
83 612
17 612
84 790
85 790
86 790
87 790
88 790
89 922
90 922
91 922
92 922
93 922
21 738
94 738
95 738
96 738
97 738
98 1355
81 807
99 807
17 807
100 725
101 725
17 725
102 725
103 725
23 1046
104 661
105 661
106 661
107 661
108 661
109 907
110 907
111 907
112 907
113 907
114 840
115 840
116 840
117 840
17 840
118 759
23 759
119 759
23 761
120 761
121 761
122 761
123 1356
124 1265
125 1265
126 1265
127 1265
128 1265
129 894
29 894
130 894
131 894
132 667
133 667
124 758
134 758
135 758
122 758
136 758
137 471
138 471
You've got
for c in init_partition:
for n in c:
color_map_i[n]=colors[counter]
counter=counter+1
It looks to me like n will loop over all of the nodes of the graph. I do not see any entries in the graph that are 0. So probably the nodes are numbered 1 to N, while color_map_i is indexed from 0 to N-1. So it would break when n=N.
A good way to hunt for bugs like this in general would be to print n right before the line giving the error. This would give a hint to what the problem is.

imagemagick - find coordinates of outline of transparent png (not border)

While it's easily done to visually outline, is it possible to have imagemagick output the coordinates of the outline of a transparent image?
Note, by outline, I don't mean just the bounding box border, but the actual contour around an arbitrarily shaped transparent image geometry.
Let's say you start with this image, which has a transparent background:
You can extract the transparency and find the edges like this:
convert penguin.png -alpha extract -edge 1 -threshold 50% edges.png
If, rather than an image, you want a list of the coordinates of the contour (i.e. the white pixels), you could do this instead:
convert penguin.png -alpha extract -edge 1 -threshold 50% -depth 8 txt: | awk -F: '/white/{print $1}'
256,0
253,1
254,1
255,1
257,1
258,1
259,1
253,2
259,2
252,3
253,3
...
...
Replace awk and everything after it with more to see what the awk is actually doing - it is just printing the coordinates of every pixel that is white.
The above pixels come out in row order, not like a contour where adjacent pixels come out together. If you want that, you might prefer to generate an SVG of the transparency with potrace like this:
convert penguin.png -alpha extract -threshold 50% pgm:- | potrace - --svg < alpha.pgm > result.svg
Output
<?xml version="1.0" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 20010904//EN"
"http://www.w3.org/TR/2001/REC-SVG-20010904/DTD/svg10.dtd">
<svg version="1.0" xmlns="http://www.w3.org/2000/svg"
width="500.000000pt" height="577.000000pt" viewBox="0 0 500.000000 577.000000"
preserveAspectRatio="xMidYMid meet">
<metadata>
Created by potrace 1.13, written by Peter Selinger 2001-2015
</metadata>
<g transform="translate(0.000000,577.000000) scale(0.100000,-0.100000)"
fill="#000000" stroke="none">
<path d="M0 3294 l0 -2476 25 16 c31 20 104 21 152 1 20 -8 38 -14 40 -12 2 2
-8 35 -23 73 -49 129 -29 246 43 260 51 10 109 -12 159 -58 43 -40 44 -40 44
-15 0 41 60 120 99 130 28 7 32 11 26 30 -24 70 -15 385 16 637 44 346 171
715 351 1023 34 58 35 77 5 77 -43 0 -242 47 -277 65 -68 35 -196 130 -248
185 -83 86 -119 154 -153 286 -31 124 -38 237 -18 307 13 49 15 49 91 -15 177
-151 388 -309 477 -357 42 -23 108 -60 146 -81 76 -43 194 -83 228 -78 19 3
23 12 28 60 4 32 12 65 19 73 15 18 16 17 -35 34 -88 29 -255 122 -255 141 0
5 12 15 27 22 24 11 48 10 168 -9 77 -12 179 -22 226 -23 80 0 87 2 111 28 60
64 64 105 11 123 -21 6 -76 30 -122 51 l-84 39 -30 -21 c-21 -16 -32 -19 -40
-11 -19 19 8 54 63 84 46 25 59 27 109 22 31 -3 66 -8 79 -11 77 -16 -65 130
-178 182 -84 38 -150 45 -340 35 -225 -12 -242 -15 -322 -55 -57 -28 -68 -31
-68 -17 0 26 77 95 150 134 206 112 479 177 741 177 l97 0 -20 38 c-37 70 -10
186 68 294 54 75 56 88 23 119 -16 15 -29 36 -29 47 0 12 -4 24 -10 27 -19 12
-10 57 18 94 69 89 125 236 167 431 24 113 41 148 68 137 19 -7 43 -83 52
-162 l7 -60 33 90 c18 50 36 110 40 135 10 61 55 111 55 61 0 -11 9 -35 20
-54 29 -49 40 -235 17 -289 -9 -21 -12 -38 -8 -38 4 0 34 22 65 50 31 27 66
55 78 61 20 11 21 9 14 -22 -3 -18 -24 -72 -46 -119 -22 -48 -40 -90 -40 -94
0 -14 43 17 110 80 143 136 324 456 306 545 -4 21 -1 31 12 38 11 7 -402 10
-1260 11 l-1278 0 0 -2476z"/>
<path d="M2613 5669 c17 -82 18 -107 9 -133 -7 -18 -12 -47 -12 -64 0 -18 -14
-60 -31 -93 -17 -34 -28 -64 -25 -67 10 -10 49 17 129 88 43 39 84 70 92 70
23 0 17 -31 -15 -79 -17 -24 -30 -49 -30 -56 0 -6 -32 -58 -70 -115 -39 -56
-69 -104 -67 -106 4 -5 111 33 153 54 33 17 42 18 57 7 9 -7 17 -17 17 -23 0
-16 -104 -125 -134 -141 -14 -8 -26 -17 -26 -20 0 -4 23 -17 50 -30 53 -25
100 -63 100 -80 0 -19 -36 -30 -116 -36 -43 -3 -103 -8 -133 -11 l-54 -6 82
-80 c92 -90 185 -225 230 -331 34 -83 71 -217 71 -260 0 -15 5 -38 12 -50 18
-34 31 -169 29 -311 -1 -97 2 -133 12 -145 8 -9 29 -43 47 -76 19 -33 58 -96
87 -139 64 -93 107 -175 122 -233 10 -36 17 -44 49 -54 59 -18 220 -22 467 -9
127 6 316 15 420 20 184 9 267 15 311 25 46 10 364 35 451 35 64 0 93 -4 96
-12 3 -7 6 566 6 1275 l1 1287 -1204 0 -1203 0 20 -101z"/>
<path d="M4971 3138 c-20 -33 -89 -96 -165 -153 -23 -17 -103 -92 -179 -168
-77 -75 -145 -137 -153 -137 -8 0 -27 -21 -43 -47 -29 -47 -30 -47 -178 -99
-81 -29 -158 -53 -171 -53 -12 -1 -61 -15 -109 -31 -50 -18 -106 -30 -133 -30
-25 1 -101 -2 -170 -6 -100 -5 -136 -3 -179 10 -30 9 -70 16 -89 16 -33 0 -33
-1 -25 -32 4 -18 8 -92 8 -164 l0 -132 32 -20 c17 -12 70 -64 117 -117 68 -76
96 -119 140 -208 58 -118 106 -256 106 -302 0 -24 3 -26 23 -20 12 3 42 9 67
12 69 8 83 -18 98 -180 16 -176 45 -309 96 -446 24 -64 48 -129 54 -146 17
-50 52 -247 52 -295 0 -69 -42 -141 -129 -221 -99 -91 -141 -111 -239 -112
-91 -2 -129 14 -157 67 -23 42 -35 44 -76 14 -35 -26 -138 -36 -191 -19 -60
20 -78 45 -78 106 0 47 4 57 24 70 41 27 148 62 227 74 41 6 89 14 107 17 19
3 32 11 32 20 0 8 12 42 26 75 24 58 25 66 15 145 -7 57 -25 120 -57 204 -26
66 -50 120 -53 120 -16 -1 -49 -42 -73 -90 -56 -110 -125 -182 -277 -287 -66
-46 -158 -73 -248 -73 -82 0 -93 -4 -173 -65 -85 -65 -188 -116 -290 -143
-139 -38 -226 -46 -470 -45 -201 1 -250 4 -339 23 -57 12 -116 20 -130 17 -14
-3 -73 -22 -131 -42 -97 -34 -111 -37 -192 -33 l-87 4 -6 -40 c-4 -23 -9 -55
-12 -73 -7 -43 -51 -82 -104 -92 -25 -5 786 -9 1934 -10 l1977 -1 0 1585 c0
872 -2 1585 -4 1585 -2 0 -14 -15 -25 -32z"/>
<path d="M0 358 l0 -358 458 1 c393 1 450 2 407 14 -146 37 -255 105 -435 270
-30 28 -102 83 -160 122 -173 118 -218 166 -256 271 -11 32 -13 -14 -14 -320z"/>
</g>
</svg>
Assuming you have a transparent "input.png", first convert all nontransparent pixels to white, then use the "-edge" option to find the transitions between transparent and white:
convert input.png -negate -threshold 1 -edge 1 edge.png
Note that this will not only outline the image but will outline any "holes" in it as well. For example, try it with the built-in "logo:" image:
convert logo: -transparent white logotrans.png
convert logotrans.png -negate -threshold 1 -edge 1 t.png
which transforms this
to this

Convert PNG images to pixel gray-level values feature vectors

I am a newbie in MATLAB and I have a set of bmp images which I need to convert into pixel gray-level values as feature vectors of image. Can anyone suggest me the way how I can do that?
I need to use these pixel gray-level values as features and then perform operations like PCA/LDA.
I tried imread() but it returns me a matrix.. I feel feature vector will be just one row vector.
Regards,
imread() is the correct way to do it. Then just convert from a matrix into a vector. For example:
>> X = randi(255, 10)
X =
208 41 168 181 112 71 192 215 90 20
231 248 10 9 98 174 66 65 212 14
33 245 217 71 196 168 130 208 150 136
233 124 239 12 203 42 179 63 141 199
162 205 174 25 48 31 228 237 234 239
25 37 194 210 125 128 245 90 73 34
72 108 190 178 114 245 140 51 194 146
140 234 101 81 165 87 36 65 193 120
245 203 168 243 181 150 39 158 98 4
247 245 44 9 193 58 66 121 145 86
>> X(:)
ans =
208
231
33
233
162
25
72
140
245
247
...
Then you can just stack your different observations together with [] and do PCA.

Resources