Create a list of all matched strings without duplicates - grep

I have a list of urls that look like:
http://example.com/page1
http://example.com/page1
http://example.com/page1
http://example.com/page2
http://example.com/page2
http://example.com/page3
From this, I want to create a list that's like:
http://example.com/page1
http://example.com/page2
http://example.com/page3
So if there are more than one match, I want to return only one of the matches. What would the grep pattern be for that? Thanks.

You can very easily do it using awk
$ awk '!url[$0]++' input
http://example.com/page1
http://example.com/page2
http://example.com/page3

Related

Merge Multiple Pipe Delimited Files Using Awk?

I would like to straight up merge multiple pipe delimited files using Awk. Every example I have found on here is several times more complicated than what i am trying to so. I have several text files formatted identically, and just want to merge them together, like a UNION ALL in SQL. Don't need to join on a column, and don't care about duplicate rows.
Concatenating the files should work for you then:
cat file1.txt file2.txt file3.txt > finalFile.txt
No need for awk.
That is a job for cat (see #mjuarez's answer) but if you really want to use awk for it:
$ awk 1 files* > another_file
(g)awk '{print}' file1 file2 file* >> outputfile
Given that you aim to use awk only.
James' answer is way better,
However I still want to show what I came with, the very basic using of awk. :)

How to diff records within same file

I have following file format:
AAA-12345~TRAX~~AAAAAAAAAAAA111111ETC
AAA-12345~RCV~~BBBBBBBBBBBB222222ETC
BBB-78900~TRAX~~CCCCCCCCCCCC444444ETC
BBB-78900~RCV~~DDDDDDDDDDDD555555ETC
CCC-65432~TRAX~~HHHHHHHHHHHH888888ETC
All lines are in pairs and each pair is identical up single ~.
Sometimes there are orphans like last record which has TRAX but no RCV.
Question is: using bash utilies like sed or awk or commands like grep or cut how do I find and display orphans only?
Using awk:
awk -F~ '{a[$1]+=1} END{for(key in a) if(a[key]==1){print key}}'
This is just loading the first field (split by tilde) as they key of an array and incrementing the value for that key each time it's found. Then when the file is finished, it iterates the array and prints out key's with just 1 for the value.

Result from 'findstr /G:' not complete, comparing to 'grep -f'

I have to look up a list of thousands of gene names (genelist.txt, one column) in a database file called (database.txt, multiple columns). Any lines containing at least one gene name that match the genelist.txt will be extract to output.txt.
I used to do it like this:
findstr /G:genelist.txt database.txt >output.txt
It works well and fast. However, I just found out today that the final output is affected by the gene order in the original genelist.txt. There is one result if using an unsorted gene list, and another result with more lines if sorting the gene list and searching again. But even with the sorted gene list the file output.txt does still not contain all lines as I'm missing some records. I just noticed this after a comparison with the result from
grep -f "genelist.txt" database.txt > output.txt
The results from grep have no difference no matter the gene list is sorted or not, but it's a bit slower than findstr.
I was wondering how this come. Can I add any arguments to findstr to make it return a complete results list?

How to search a pattern like pattern using grep only?

I am searching list of names with pattern "japconfig".There are many files inside one directory. Those files contain names like ixdf_japconfig_FZ.txt,
ixdf_japconfig_AB.txt, ixdf_japconfig_RK.txt, ixdf_japconfig_DK.txt, ixdf_japconfig_LY.txt. But I don't know what are the names present after japconfig word. I need to list down all such names. Also my files contain ixdf_dbconfig.txt, but I don't want to print ixdf_dbconfig.txt in the output.
Each of my file contains one ixdf_japconfig_*.txt and ixdf_dbconfig.txt where * can be FZ,AB,RK,DK,LY. I can achieve my desired result by using grep and then awk to cut the columns.But I don't want to use AWK or other command. I want to achive using grep only.
I need to print below names.
ixdf_japconfig_FZ.txt
ixdf_japconfig_AB.txt
ixdf_japconfig_RK.txt
ixdf_japconfig_DK.txt
ixdf_japconfig_LY.txt
I don't want to print ixdf_dbconfig.txt.
When I tried using "grep -oh "ixdf_japconfig.*.txt" *.dat" command, I am getting below output.
ixdf_japconfig_FZ.txt ixdf_dbconfig.txt
ixdf_japconfig_AB.txt ixdf_dbconfig.txt
ixdf_japconfig_RK.txt ixdf_dbconfig.txt
ixdf_japconfig_DK.txt ixdf_dbconfig.txt
ixdf_japconfig_LY.txt ixdf_dbconfig.txt
where first column is my desired column. But I don't want to print second column. How can I change my code to print only first column?
grep -oh ixdf_japconfig_...txt *.dat
(Your .*. was matching most of the line.)

GREP create a list of words that contain a sting

I have a folder with a lot of text files and would like to get a list of all words in that folder that contain a certain string. So, e.g. there is words in the form of 'IND:abc', 'IND:cde', ... and I am looking for a way to get a list of all words starting with IND:, so something like:
[IND:abc, IND:cde, IND:...]
Can grep do that?
Cheers,
Chris
grep -ho 'IND:\w\+' * | sort | uniq
-h suppresses the filenames so that you will only get the text. -o prints only the matching path of the text. If you want to see the duplicates just remove the sort, and uniq.

Resources