Comparing files based on two columns

Comparing files based on two columns - join

I have two files with thousands of lines:
file1:
COL22A1 LCT 1 12 0.149667616334 2.16226378401
GPRIN2 TP53 12 170 0.0455368539793 44.2359753827
MUC3A TP53 12 170 0.0455368539793 44.2359753827
file2:
COL22A1 LCT 12 41 23 0.0296296296296 0.101234567901 0.0567901234568 2.36563
MEGF10 SORCS1 10 21 39 0.0246913580247 0.0518518518519 0.0962962962963 2.30599
I want to compare first two columns of these files and if they match I want to print whole line of second file and last column of first file:
output:
COL22A1 LCT 12 41 23 0.0296296296296 0.101234567901 0.0567901234568 2.36563 2.16226378401
I tried awk, grep, join but it always gives me output of just one file

Could you please try following and let us know then.
awk 'FNR==NR{a[$1,$2]=$NF;next} a[$1,$2]{print $0,a[$1,$2]}' Input_file1 Input_file2

Related

YouTube Data API returning inconsistent results with duplicates

There have been numerous questions about inconsistent results from the YouTube Data API: 1, 2, 3, 4, 5, 6. Most of them have accepted answers that seem to indicate there was a problem with the API request that was fixed by the instructions in the answers. But none of those situations apply to the API request discussed here.
There have also been two questions about duplicates in the API results: 1, 2. Both of them have the same answer, which says to use the next-page token. But both questions say the token was used, so that answer is not helpful.
Yesterday, I submitted a series of API requests to get the list of most-viewed videos about 3D printing. The first request in the series was:
https://www.googleapis.com/youtube/v3/search?q=3D print&type=video&maxResults=50&part=id,snippet&order=viewCount&key=<my key>
I ran that in a VBA sub, which took the next-page token from each result and resubmitted the URL with &pageToken=<nextPageToken> inserted.
The result was a list of 649 unique video IDs. So far so good.
After making some changes in the VBA code and seeing some duplicates in the result set, I went back today and ran the original VBA sub again. The result was again a list of 649 video IDs, but this time the list included duplicates and it also included IDs that were not in yesterday's list and was missing IDs that were there yesterday. Here is a comparison from the first two pages and the last two pages of the two result sets:
Page
# on page
# overall
Run 1
Run 2
Same as
Seq
Dup
1
1
1
f2mdMcf-fJs
f2mdMcf-fJs
1
1
2
2
WSauz5KVKTU
WSauz5KVKTU
2
Seq
1
3
3
zsSCUWs7k9Q
XYIUM5TkhMo
None
1
4
4
B5Q1J5c8oNc
zsSCUWs7k9Q
3
Seq
1
5
5
cUxIb3Pt-hQ
B5Q1J5c8oNc
4
Seq
1
6
6
4yyOOn7pWnA
LDjE28szwr8
None
1
7
7
3N46jQ0Xi3c
cUxIb3Pt-hQ
5
Seq
1
8
8
08dBVz8_VzU
4yyOOn7pWnA
6
Seq
...
1
13
13
oeKIe1ik2O8
e1rQ8YwNSDs
11
Seq
1
14
14
FrG_eSECfps
RVB2JreIcoc
12
Seq
1
15
15
pPQCwz2q96o
oeKIe1ik2O8
13
Seq
1
16
16
uo3KuoEiu3I
pPQCwz2q96o
15
NOT
1
17
17
0U6aIwd5h9s
uo3KuoEiu3I
16
Seq
...
1
47
47
ShGsW68zbIo
iu9rhqsvrPs
46
Seq
1
48
48
0q0xS7W78KQ
ShGsW68zbIo
47
Seq
1
49
49
UheJQsXOAnY
0q0xS7W78KQ
48
Seq
Dup
1
50
50
H8AcqOh0wis
H8AcqOh0wis
50
NOT
Dup
2
1
51
EWq3-2VuqbQ
0q0xS7W78KQ
48
NOT
Dup
2
2
52
scuTZza4f_o
H8AcqOh0wis
50
NOT
Dup
2
3
53
bJWJW-mz4_U
UheJQsXOAnY
49
NOT
2
4
54
Ii4VYsh9OlM
EWq3-2VuqbQ
51
NOT
2
5
55
r2-OGUu57pU
scuTZza4f_o
52
Seq
2
6
56
8KTnu18Mi9Q
bJWJW-mz4_U
53
Seq
2
7
57
DconsfGsXyA
Ii4VYsh9OlM
54
Seq
2
8
58
UttEvLJP3l8
8KTnu18Mi9Q
56
NOT
2
9
59
GJOOLH9ZP2I
DconsfGsXyA
57
Seq
2
10
60
ewgmg9Q5Ab8
UttEvLJP3l8
58
Seq
...
13
35
635
qHpR_p8lA4I
FFVOzo7tSV8
639
Seq
13
36
636
DplwDDZNTRc
76IBjdM9s6g
640
Seq
13
37
637
3AObqGsimr8
qEh0uZuu7_U
None
13
38
638
88keQ4PWH18
RhfGJduOlrw
641
Seq
13
39
639
FFVOzo7tSV8
QxzH9QkirCU
643
NOT
13
40
640
76IBjdM9s6g
Qsgz4GbL8O4
None
13
41
641
RhfGJduOlrw
BSgg7mEzfqY
644
Seq
13
42
642
lVEqwV0Nlzg
VcmjbJ2q8-w
645
Seq
13
43
643
QxzH9QkirCU
gOU0BCL-TXs
None
13
44
644
BSgg7mEzfqY
IoOXQUcW24s
646
Seq
13
45
645
VcmjbJ2q8-w
o4_2_a6LzFU
647
Seq
Dup
14
1
646
IoOXQUcW24s
o4_2_a6LzFU
647
NOT
Dup
14
2
647
o4_2_a6LzFU
ijVPcGaqVjc
648
Seq
14
3
648
ijVPcGaqVjc
nk3FlgEuG-s
649
Seq
14
4
649
nk3FlgEuG-s
27ZLFn8Dejg
None
The last three columns have the following meanings:
Same as: If an ID from Run 2 is the same as an ID from Run 1, then this column has the # overall for Run 1.
Seq: Indicates whether the number in column "Same as" is one more than the previous number in that column.
Dup: Indicates whether an ID from Run 2 occurred more than once in that run.
Problems:
The videos XYIUM5TkhMo, LDjE28szwr8, qEh0uZuu7_U, Qsgz4GbL8O4, gOU0BCL-TXs, and 27ZLFn8Dejg were returned as #3, 6, 637, 640, 643, and 649 in Run 2, but were not returned at all in Run 1.
The videos FrG_eSECfps, r2-OGUu57pU, lVEqwV0Nlzg were returned as #14, 55, 642, in Run 1, but were not in Run 2.
The videos 0q0xS7W78KQ, H8AcqOh0wis, and o4_2_a6LzFU were returned as #49, 50, and 645 in Run 2, but then each appears a second time in that run (as well as appearing in Run 1 as #48, 50, and 647).
These results are troubling. They mean that no single search will return a reliable list of videos for a given value of q.
I mentioned at the beginning that previous questions about inconsistent results from the YouTube Data API had answers that seemed to resolve those inconsistencies. Is there a way to do that for this search? Is there something wrong with the way I'm composing the search that is causing the problem?
If there isn't a way to fix the search, then I suppose the only way to get a list of videos on the topic with high confidence of it being complete is to run the search multiple times and merge the results until no new IDs appear that were not in a previous result set. But even then, one would not know if there are other videos lurking undetected.

Tableau running count reset

I have a list of sporting matches by time with result and margin. I want Tableau to keep a running count of number of matches since the last x (say, since the last draw - where margin = 0).
This will mean that on every record, the running count will increase by one unless that match is a draw, in which case it will drop back to zero.
I have not found a method of achieving this. The only way I can see to restart counts is via dates (e.g. a new year).
As an aside, I can easily achieve this by creating a running count tally OUTSIDE of Tableau.
The interesting thing is that Tableau then doesn't quite deal with this well with more than one result on the same day.
For example, if the structure is:
GameID Date Margin Running count
...
48 01-01-15 54 122
49 08-01-15 12 123
50 08-01-15 0 124
51 08-01-15 17 0
52 08-01-15 23 1
53 15-01-15 9 2
...
Then when trying to plot running count against date, Tableau rearranges the data to show:
GameID Date Margin Running count
...
48 01-01-15 54 122
51 08-01-15 17 0
52 08-01-15 23 1
49 08-01-15 12 123
50 08-01-15 0 124
53 15-01-15 9 2
...
I assume it is doing this because by default it sorts the running count data in ascending order when dates are identical.

xcode : retrieving one line of xcode based on search query

Here is a sample of my CSV
10820 0 0 0 0
10900 2 4 4 4
11000 21 50 54 58
11100 23 54 59 63
11200 25 59 63 68
11300 27 63 68 73
11400 29 68 73 78
11500 31 72 78 83
11600 32 76 82 88
11700 34 81 87 93
I'm looking to create to use xcode to retreive one line of code from this very lengthy CSV based on the first line.
For example:
if the user enters "10900", the second line columns will be returned.
If the user returns 11650, the 11600 line columns will be returned...always taking the lower line when the input value is less then the following line.
Any help would be appreciated. I've seen code to parse an entire CSV file, but I'm thinking this may be a big memory drain, right now my CSV has 2000 lines of values, which are all in ascending order based on the first column.

You have to load a file into memory anyways to find correct value.
With such a big CSV file I would recommend to turn CSV file into binary file (plist file for example) and put it as binary into your application - instead of parsing it each time in RunTime. It has much better performance and it's easier to work with that since you are working directly with NSDictonaries an NSArrays.
If you don't want to do it for some reason, the next solution is to use something like CHCSVParser:
https://github.com/davedelong/CHCSVParser
It provides optimization for loading only part of file at a time - which is the optimization you might be looking for.

Parsing a Large Text in Sections in Matlab

I have a large text file as below imported in MATLAB:
Run Lat Long Time
1 32 32 34
1 23 22 21
2 23 12 11
2 11 11 11
2 33 11 12
up to 10 runs etc.
So I'm trying to break up each section in the file: section 1, section 2, etc and write it to 10 different text files. File 1 will have data from Run 1. File 2 will have data from Run 2.

What you're looking for is Matlab's textread function. I'll give you the pieces you need and frame out the logic, but you'll need to connect the pieces yourself :)
Your read would look something like this
[head1, head2, head3, head4] = textread(file_name,'%s %s %s %s',1);
[run, lat, long, time] = textread(file_name,'%u %u %u %u');
and your write method would use a loop to iterate over the values in
unique(run)
creating a file with
fout = fopen([base_file_name_out num2str(run_number)]);
and writing to it the values contained in
lat_this_run=Lat(run==run_number);
using the method
fprintf(fout,'%u %u %u\n', lat_this_run, long_this_run, time_this_run)

If your data is already loaded into matlab and named A, you could do:
>> a = max(A(:,1));
>> AA={};
>> for i = 1:a
AA{i}=A(find(A(:,1)==i),:)
name=sprintf('%d.txt',i);
dlmwrite(name,AA{i},'\t');
end
The output will be .txt files containing tab-delimited data.

How can i use gsub to replace "0" (only)

gsub('$0\n','') isn't working
I would prefer something similar. I want:
(note the 10 and 20 have to work with 0 not being replaced in them).
If I have:
23
12
0
15
9
0
10
20
0
I want:
23
12
15
9
10
20

You may want to convert this to an array to re-process it, but the same thing can be done with a regular expression:
string.gsub(/^\s+0+$/m, '')
The /m part is key and it makes the expression operate in multi-line mode, that is ^ and $ refer to the beginning and ending of a line, not the beginning and ending of the string as is usually the case.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Comparing files based on two columns - join

Could you please try following and let us know then. awk 'FNR==NR{a[$1,$2]=$NF;next} a[$1,$2]{print $0,a[$1,$2]}' Input_file1 Input_file2

Related

YouTube Data API returning inconsistent results with duplicates

Tableau running count reset

xcode : retrieving one line of xcode based on search query

Parsing a Large Text in Sections in Matlab

How can i use gsub to replace "0" (only)

Categories

Resources