Result from 'findstr /G:' not complete, comparing to 'grep -f' - grep

I have to look up a list of thousands of gene names (genelist.txt, one column) in a database file called (database.txt, multiple columns). Any lines containing at least one gene name that match the genelist.txt will be extract to output.txt.
I used to do it like this:
findstr /G:genelist.txt database.txt >output.txt
It works well and fast. However, I just found out today that the final output is affected by the gene order in the original genelist.txt. There is one result if using an unsorted gene list, and another result with more lines if sorting the gene list and searching again. But even with the sorted gene list the file output.txt does still not contain all lines as I'm missing some records. I just noticed this after a comparison with the result from
grep -f "genelist.txt" database.txt > output.txt
The results from grep have no difference no matter the gene list is sorted or not, but it's a bit slower than findstr.
I was wondering how this come. Can I add any arguments to findstr to make it return a complete results list?

Related

Combine log lines with awk

I have a log file that simplified looks like this (it has enough columns so that direct addressing of the columns is not feasible):
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
Most of the fields appear only once by id but not all of them (see time, for example).
What I want to achieve is to have one line per id that combines the columns of all records with the same id:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:03.006Z,bar_host,192.168.0.13,bar_user
For the columns that appear more than once in each id, I don't care which one is returned as long as it relates to a record with the same id.
I would exploit GNU AWK 2D arrays following way, let file.txt content be
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,,,
foo1,2022-05-10T00:01.002Z,foo_host,,
foo1,2022-05-10T00:01.003Z,,192.168.0.1,
foo1,2022-05-10T00:01.004Z,,,foo_user
bar1,2022-05-10T00:02.005Z,,,
bar1,2022-05-10T00:03.006Z,bar_host,,
bar1,2022-05-10T00:04.007Z,,192.168.0.13,
bar1,2022-05-10T00:05.008Z,,,bar_user
then
awk 'BEGIN{FS=OFS=",";cols=5}NR==1{print}NR>1{for(i=1;i<=cols;i+=1){arr[$1][i]=arr[$1][i]?arr[$1][i]:$i}}END{for(i in arr){for(j in arr[i]){$j=arr[i][j]};print}}' file.txt
output
id,time,host,ip,user_uuid
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
Explanation: Firstly I inform GNU AWK that both field separator (FS) and output field separator (OFS) is ,, I use cols variable for holding information how many columns you wish to have. First row I simply print, for following rows for each column I check if there is already some truthy value in arr[id][number of field] using so-called ternary operator if yes I use it otherwise I set value to current field. In END I use nested for loops, for each id I do set value of its field in current line, so GNU AWK build string from these which I can print. Disclaimer: this solution assumes number of columns is equal in all lines and number of columns is known a priori and any order of output is acceptable. If this does not hold then develop own superior solution.
(tested in gawk 4.2.1)
You can use the ruby csv parser to group then reduce the repeated entries:
ruby -r csv -e '
data=CSV.parse($<.read, **{:col_sep=>","})
puts data[0].to_csv
data[1..].group_by { |row| row[0] }.
each{ |k, arr|
puts arr.transpose().map{ |ta| ta.find { |x| !x.nil? }}.to_csv
}
' file
Prints:
id,time,host,ip,user_uuid
foo1,2022-05-10T00:01.001Z,foo_host,192.168.0.1,foo_user
bar1,2022-05-10T00:02.005Z,bar_host,192.168.0.13,bar_user
This assumes the valid data is the first non-nil, nonblank encountered for that particular column.

How to limit Jenkins API response to last n build IDs

http://xxx/api/xml?&tree=builds[number,description,result,id,actions[parameters[name,value]]]
Above API returns all the build IDs. Is there a way to limit results to get last 5 build IDS?
The tree query parameter allows you to explicitly specify and retrieve only the information you are looking for, by using an XPath-ish path expression. The value should be a list of property names to include, with sub-properties inside square braces. Try tree=jobs[name],views[name,jobs[name]] to see just a list of jobs (only giving the name) and views (giving the name and jobs they contain). Note: for array-type properties (such as jobs in this example), the name must be given in the original plural, not in the singular as the element would appear in XML (). This will be more natural for e.g. json?tree=jobs[name] anyway: the JSON writer does not do plural-to-singular mangling because arrays are represented explicitly.
For array-type properties, a range specifier is supported. For example, tree=jobs[name]{0,10} would retrieve the name of the first 10 jobs. The range specifier has the following variants:
{M,N}: From the M-th element (inclusive) to the N-th element (exclusive).
{M,}: From the M-th element (inclusive) to the end.
{,N}: From the first element (inclusive) to the N-th element (exclusive). The same as {0,N}.
{N}: Just retrieve the N-th element. The same as {N,N+1}.
Another way to retrieve more data is to use the depth=N query parameter . This retrieves all the data up to the specified depth. Compare depth=0 and depth=1 and see what the difference is for yourself. Also note that data created by a smaller depth value is always a subset of the data created by a bigger depth value.
Because of the size of the data, the depth parameter should really be only used to explore what data Jenkins can return. Once you identify the data you want to retrieve, you can then come up with the tree parameter to exactly specify the data you need.
I'm on version 1.509.4. which doesn't support range specifier.
Source: http://ci.citizensnpcs.co/api/
You can create an xml object with the build numbers via xpath and parse it yourself with via different means.
http://xxx/api/xml?xpath=//build/number&wrapper=meep
Creates an xml that looks like:
<meep>
<number>n</number>
<number>n+1</number>
...
<number>m</number>
</meep>
And will be populated with the build numbers n through m that are currently in jenkins for the specified job in the url. You can substitute anything for the word "meep", that will become the wrapper object for the newly created xml object.
How are you collecting/manipulating the api xml output once you get it? Because there is a solution here for How do I select the last N elements with XPath?. I tried using some of these xpath manipulations but I couldn't get it to work when playing with the url in my browser; it might work if you are doing something else.
When I get the xml object, I happen to manipulate it via shell scripts.
#!/bin/sh
# NOTE: To get the url to work with curl, you need a valid jenkins user and api token
# Put all build numbers in a variable called build_ids
build_ids="$(curl -sL --user ${_jenkins_api_user}:${_jenkins_api_token} \
"${_jenkins_url}/job/${_job_name}/api/xml?xpath=//build/number&wrapper=meep" \
| sed -e 's/<[^>]*>/ /g' | sed -e 's/ / /g')"
# Print the last 5 items with awk
echo "${build_ids}" | awk '{n = 5; for (--n; n >= 0; n--){ printf "%s\t",$(NF-n)} print ""}';
Once you have your xml object you can essentially parse it however you want.
NOTE: I am running Jenkins ver. 2.46.1
Looking at the doco at the raw .../api/ endpoint (on Jenkins 2.60.3) it says
For array-type properties, a range specifier is supported. For
example, tree=jobs[name]{0,10} would retrieve the name of the first 10
jobs. The range specifier has the following variants:
{M,N}: From the M-th element (inclusive) to the N-th element (exclusive).
{M,}: From the M-th element (inclusive) to the end.
{,N}: From the first element (inclusive) to the N-th element (exclusive). The same as {0,N}.
{N}: Just retrieve the N-th element. The same as {N,N+1}.
For the OP's case, you'd append {,5} to the end of the URL to get the first 5 results:
http://xxx/api/xml?&tree=builds[number,description,result,id,actions[parameters[name,value]]]{,5}

How to diff records within same file

I have following file format:
AAA-12345~TRAX~~AAAAAAAAAAAA111111ETC
AAA-12345~RCV~~BBBBBBBBBBBB222222ETC
BBB-78900~TRAX~~CCCCCCCCCCCC444444ETC
BBB-78900~RCV~~DDDDDDDDDDDD555555ETC
CCC-65432~TRAX~~HHHHHHHHHHHH888888ETC
All lines are in pairs and each pair is identical up single ~.
Sometimes there are orphans like last record which has TRAX but no RCV.
Question is: using bash utilies like sed or awk or commands like grep or cut how do I find and display orphans only?
Using awk:
awk -F~ '{a[$1]+=1} END{for(key in a) if(a[key]==1){print key}}'
This is just loading the first field (split by tilde) as they key of an array and incrementing the value for that key each time it's found. Then when the file is finished, it iterates the array and prints out key's with just 1 for the value.

How to search a pattern like pattern using grep only?

I am searching list of names with pattern "japconfig".There are many files inside one directory. Those files contain names like ixdf_japconfig_FZ.txt,
ixdf_japconfig_AB.txt, ixdf_japconfig_RK.txt, ixdf_japconfig_DK.txt, ixdf_japconfig_LY.txt. But I don't know what are the names present after japconfig word. I need to list down all such names. Also my files contain ixdf_dbconfig.txt, but I don't want to print ixdf_dbconfig.txt in the output.
Each of my file contains one ixdf_japconfig_*.txt and ixdf_dbconfig.txt where * can be FZ,AB,RK,DK,LY. I can achieve my desired result by using grep and then awk to cut the columns.But I don't want to use AWK or other command. I want to achive using grep only.
I need to print below names.
ixdf_japconfig_FZ.txt
ixdf_japconfig_AB.txt
ixdf_japconfig_RK.txt
ixdf_japconfig_DK.txt
ixdf_japconfig_LY.txt
I don't want to print ixdf_dbconfig.txt.
When I tried using "grep -oh "ixdf_japconfig.*.txt" *.dat" command, I am getting below output.
ixdf_japconfig_FZ.txt ixdf_dbconfig.txt
ixdf_japconfig_AB.txt ixdf_dbconfig.txt
ixdf_japconfig_RK.txt ixdf_dbconfig.txt
ixdf_japconfig_DK.txt ixdf_dbconfig.txt
ixdf_japconfig_LY.txt ixdf_dbconfig.txt
where first column is my desired column. But I don't want to print second column. How can I change my code to print only first column?
grep -oh ixdf_japconfig_...txt *.dat
(Your .*. was matching most of the line.)

GREP create a list of words that contain a sting

I have a folder with a lot of text files and would like to get a list of all words in that folder that contain a certain string. So, e.g. there is words in the form of 'IND:abc', 'IND:cde', ... and I am looking for a way to get a list of all words starting with IND:, so something like:
[IND:abc, IND:cde, IND:...]
Can grep do that?
Cheers,
Chris
grep -ho 'IND:\w\+' * | sort | uniq
-h suppresses the filenames so that you will only get the text. -o prints only the matching path of the text. If you want to see the duplicates just remove the sort, and uniq.

Resources