Find specific words after a match

Find specific words after a match - grep

I have a dataset that looks something like this:
chr1 StringTie exon 197757319 197757401 1000 + . gene_id "MSTRG.10429"; transcript_id "ENST00000440885.1"; exon_number "1"; gene_name "RP11-448G4.4"; ref_gene_id "ENSG00000224901.1";
chr1 StringTie exon 197761802 197761965 1000 + . gene_id "MSTRG.10429"; transcript_id "ENST00000440885.1"; exon_number "2"; gene_name "RP11-448G4.4"; ref_gene_id "ENSG00000224901.1";
chr9 StringTie exon 63396911 63397070 1000 - . gene_id "MSTRG.145111"; transcript_id "MSTRG.145111.1"; exon_number "1";
chr9 StringTie exon 63397111 63397185 1000 - . gene_id "MSTRG.145111"; transcript_id "MSTRG.145111.1"; exon_number "2";
chr21 StringTie exon 44884690 44884759 1000 + . gene_id "MSTRG.87407"; transcript_id "MSTRG.87407.1"; exon_number "1";
chr22 HAVANA exon 19667023 19667199 . + . gene_id "ENSG00000225007.1"; transcript_id "ENST00000452326.1"; exon_number "1"; gene_name "AC000067.1";
chr22 HAVANA exon 19667446 19667555 . + . gene_id "ENSG00000225007.1"; transcript_id "ENST00000452326.1"; exon_number "2"; gene_name "AC000067.1";
I want to isolate the gene_ids. Therefore, the desired output is:
MSTRG.10429
MSTRG.10429
MSTRG.145111
MSTRG.145111
MSTRG.87407
ENSG00000225007.1
ENSG00000225007.1
I've tried the following:
grep -E -o "gene_id.{0,20}" gtf_om_ENSGids_te_vinden.gtf > alle_gene_ids.txt
With this I can grep the 20 characters after "gene_id" and I wanted to later remove the other characters which do not belong to the answer such as parts of the word "transcript". However, a problem is that the ref_gene_ids also get copied, which does not belong to the desired output. I tried to solve this by adding the -w flag, but this is also wrong for some reason. Can anyone help?
Thanks!

GNU grep, using the perl regex flag:
grep -Po '(?<=\Wgene_id ")[^"]+'
POSIX sed:
sed -En 's/.*[^[:alnum:]_]gene_id "([^"]+).*/\1/p'
If there are multiple occurrences per line, the grep will print all of them, but the sed will print the last occurrence only.

Use:
grep -o -E ' gene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf | sed -E 's/gene_id|"| //g'
the space in ' gene_id is needed to make sure the ref_gene_id is not matched.
The sed part will remove gene_id, the space, and the double quotes.
see: https://regex101.com/r/TDA7Cg/1
EDIT: Because of the tab, which is not a space:
Change it to
grep -o -E '[ \t]gene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf | sed -E 's/gene_id|"| //g'
or to just find the start of the word you could to
grep -o -E '\Wgene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf | sed -E 's/gene_id|"| //g'
But still the accepted answer is a nicer way to do it ... 😉

Related

Search for part of string with grep in all files in folder and subfolders

I have .html files in directories and subdirectories. I need to extract all strings that starts with "domain.com". Part of string can look like this:
["https://example.com/folder1",
href="https://example.com/anotherfolder2" target="
etc.
What I want to extract is:
folder1
anotherfolder2
etc.
from all files in all folders to one list, each word - new line.
Found some examples on StackOverflow with many likes, but not worked. I tried like this (from some examples):
grep -Po '(?<=example.com=)[^,]*'
Thank you for help!

grep "example.com" your-directory -r | grep -o '".*"' | cut -d \" -f2| sed -e 's/https:\/\/example.com\///g'
grep "example.com" your-directory -r | grep -o '".*"' your-directory -r | cut -d \" -f2 extracts the content of quoted string
sed -e 's/https:\/\/example.com\///g' get the suffix of https://example.com/

echo "https://example.com/folder1" | tr -s '/' | tr '/' '\n' > file
sed -i '1d' file
sed -n '1p' file # This will give you example.com
sed -n '2p' file # This will give you folder1
sed -i 1s'#example\.com#newsite.com#' file
echo "http://" > nf
sed -n '2,$p' file >> nf
cat nf | tr '\n' '/' > newfile
cat newfile # This should be http://newsite.com/folder1
rm -v ./nf

Why these patterns return same result?

I saw this question: count (non-blank) lines-of-code in bash
I understand this pattern is correct.
grep -vc ^$ filename
Why this pattern returns same result?
grep -c '[^ ]' filename
What is trick in '[^ ]'?

$ printf 'foo 123\n \nxyz\n\t\n' > ip.txt
$ cat -T ip.txt
foo 123
xyz
^I
$ grep -vc '^$' ip.txt
4
$ grep -c '[^ ]' ip.txt
3
$ grep -c '[^[:blank:]]' ip.txt
2
grep -c '[^ ]' counts any line that has a non-space character. For example, foo 123 will be counted since alphabets are not space characters. So, which one to use depends on whether a line containing only space characters should be counted or not.

Jenkins pipeline failing for sed command

I'm running a sed command in pipeline but it is failing with an error
sed: file - line 2: unterminated `s' command
here is my code for sed in pipeline:
sh '''sed '/^[[:blank:]]*$/d;s|^|s/%%|;s|:|%%/|;s|$|/|' key.txt | sed -f - file1.txt > file2.txt'''
I if I will run only first sed command, Jenkins gives very strange output, here it is from Jenkins logs its adding extra lines with characters:
+ sed '/^[[:blank:]]*$/d;s|^|s#%%|;s|:|%%#|;s|$|#|' keys.txt
s#%%sql_server_name%%#test_seqserver_1234
#
s#%%
#
s#%%sql_login_name%%#test_login_name
#
s#%%
#
s#%%password%%#test_password
#
s#%%
#
If run following command this is what Jenkins output looks like
sh '''sed '/^[[:blank:]]*$/d;/:/!d;s|^|s/%%|;s|:|%%/|;s|$|/|'
keys.txt'''
Jenkins output:
+ sed '/^[[:blank:]]*$/d;/:/!d;s|^|s/%%|;s|:|%%/|;s|$|/|' keys.txt
s/%%sql_server_name%%/test_seqserver_1234
/
s/%%sql_login_name%%/test_login_name
/
s/%%password%%/test_password
/
s/%%SID%%/123456
/
I ran the new command:
sh '''sed -e '/:/!d;s|^\\([^:]*\\):\\(.*\\)$|s/%%\\1%%/\\2/|' -e 'N;s|\\n/|/|' keys.txt'''
Here is the output:
Running shell script
+ sed -e '/:/!d;s|^\([^:]*\):\(.*\)$|s/%%\1%%/\2/|' -e 'N;s|\n/|/|'
keys.txt
s/%%sql_server_name%%/test_seqserver_1234
/
s/%%sql_login_name%%/test_login_name
/
s/%%password%%/test_password
/
s/%%SID%%/123456
Here is xxd output for text file:
Running shell script
+ xxd keys.txt
0000000: 7371 6c5f 7365 7276 6572 5f6e 616d 653a sql_server_name:
0000010: 7465 7374 5f73 6571 7365 7276 6572 5f31 test_seqserver_1
0000020: 3233 340d 0a0d 0a73 716c 5f6c 6f67 696e 234....sql_login
0000030: 5f6e 616d 653a 7465 7374 5f6c 6f67 696e _name:test_login
0000040: 5f6e 616d 650d 0a0d 0a70 6173 7377 6f72 _name....passwor
0000050: 643a 7465 7374 5f70 6173 7377 6f72 6420 d:test_password
0000060: 0d0a 0d0a 5349 443a 3132 3334 3536 200d ....SID:123456 .
0000070: 0a0d 0a64 6566 6175 6c74 5f64 6174 6162 ...default_datab
0000080: 6173 653a 7465 6d70 6462 0d0a 0d0a 6465 ase:tempdb....de
0000090: 6661 756c 745f 6c61 6e67 7561 6765 3a75 fault_language:u
00000a0: 735f 656e 676c 6973 680d 0a0d 0a63 6865 s_english....che
00000b0: 636b 5f65 7870 6972 6174 696f 6e3a 4f46 ck_expiration:OF
00000c0: 460d 0a0d 0a63 6865 636b 5f70 6f6c 6963 F....check_polic
00000d0: 793a 4f46 460d 0a0d 0a64 656c 6976 6572 y:OFF....deliver
00000e0: 7974 7970 653a 7363 6865 6475 6c65 640d ytype:scheduled.
00000f0: 0a0d 0a73 6368 6564 756c 6564 5f64 656c ...scheduled_del
0000100: 6976 6572 7964 6174 653a 3035 2d33 302d iverydate:05-30-
0000110: 3230 3939 0d0a 0d0a 7363 6865 6475 6c65 2099....schedule
0000120: 645f 6465 6c69 7665 7279 5f32 3468 725f d_delivery_24hr_
0000130: 6365 6e74 7261 6c5f 7469 6d65 3a31 3135 central_time:115
0000140: 3920 0d0a 0d0a 0d0a 0d0a 0d0a 0d0a 0d0a 9 ..............

It looks like your key.txt file has incorrect value. Judging from the first sed command:
sed '/^[[:blank:]]*$/d; s|^|s/%%|; s|:|%%/|; s|$|/|' key.txt
it expects each line to contain a semicolon. Then it forms sed code for the second sed command:
sed -f - file1.txt > file2.txt
If your key.txt contains non-empty lines without a semicolon, you will get the error unterminated 's' command.
Ensure that key.txt is correct, or at least add /:/!d; into your pipeline. Like this:
sh '''sed '/^[[:blank:]]*$/d;/:/!d;s|^|s/%%|;s|:|%%/|;s|$|/|' key.txt | sed -f - file1.txt > file2.txt'''
For example, correct key.txt contents:
username:server1
Incorrect key.txt:
username server2
There is no semicolon in this line, so it will cause error.
You might try to replace your first sed command with a simpler one:
sed '/:/!d;s|^\([^:]*\):\(.*\)$|s/%%\1%%/\2/|' key.txt
or better:
sed -e '/:/!d;s|^\([^:]*\):\(.*\)$|s/%%\1%%/\2/|' -e 'N;s|\n/|/|' key.txt
If that doesn't help, run xxd key.txt or hexdump -C key.txt and post the output.
After you added hex contents of your key.txt file, I finally could replicate the issue on my machine. The problem could be solved by this command:
sed -e '/:/!d;s|^\([^:]*\):\(.*\)\r|s/%%\1%%/\2/|' key.txt
So the trick is to use \r instead of $ in the first sed command. If it still doesn't work for you (it might, if you use MacOS), you can just remove carriage return from the key.txt file with a tool of your choice (like dos2unix) and then your original code should work.

docker-compose wurstmeister/kafka failing to parse KAFKA_OPTS

I have a basic docker-compose file file for wurstmeister/kafka
I'm trying to configure it to use SASL_PLAIN with SSL
However I keep getting this error no matter how many ways I try to specify my jaas file
This is the error I get
[2018-04-11 10:34:34,545] FATAL [KafkaServer id=1001] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.IllegalArgumentException: Could not find a 'KafkaServer' or 'sasl_ssl.KafkaServer' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
These are the vars I have. Last one is where I specify my jaas file
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_HOST_NAME: 10.10.10.1
KAFKA_PORT: 9092
KAFKA_ADVERTISED_PORT: 9093
KAFKA_ADVERTISED_HOST_NAME: 10.10.10.1
KAFKA_LISTENERS: PLAINTEXT://:9092,SASL_SSL://:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://10.10.10.1:9092,SASL_SSL://10.10.10.1:9093
KAFKA_SECURITY_INTER_BROKER_PROTOCOL: SASL_SSL
KAFKA_SASL_ENABLED_MECHANISMS: PLAIN
SASL_MECHANISM_INTER_BROKER_PROTOCOL: PLAIN
KAFKA_SASL_MECHANISM_INTER_BROKER_PROTOCOL: PLAIN
KAFKA_SSL_TRUSTSTORE_LOCATION: /kafka.server.truststore.jks
KAFKA_SSL_TRUSTSTORE_PASSWORD: password
KAFKA_SSL_KEYSTORE_LOCATION: /kafka.server.keystore.jks
KAFKA_SSL_KEYSTORE_PASSWORD: password
KAFKA_SSL_KEY_PASSWORD: password
KAFKA_OPTS: '-Djava.security.auth.login.config=/path/kafka_server_jaas.conf'
Also when I try to check the docker logs I see
/usr/bin/start-kafka.sh: line 96: KAFKA_OPTS=-Djava.security.auth.login.config: bad substitution
Any help is greatly appreciated!

equals '=' inside the last value is causing this issue.
KAFKA_OPTS: '-Djava.security.auth.login.config=/path/kafka_server_jaas.conf'
This is what I have got after debugging.
+ for VAR in $(env)
+ [[ KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf =~ ^KAFKA_ ]]
+ [[ ! KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf =~
^KAFKA_HOME ]]
++ echo KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf
++ sed -r 's/KAFKA_(.*)=.*/\1/g'
++ tr '[:upper:]' '[:lower:]'
++ tr _ .
+ kafka_name=opts=-djava.security.auth.login.config
++ echo KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf
++ sed -r 's/(.*)=.*/\1/g'
+ env_var=KAFKA_OPTS=-Djava.security.auth.login.config
+ grep -E -q '(^|^#)opts=-djava.security.auth.login.config='
/opt/kafka/config/server.properties
start-kafka.sh: line 96: KAFKA_OPTS=-Djava.security.auth.login.config: bad
substitution
and this is the piece of code that is performing this operation.
88 for VAR in $(env)
89 do
90 if [[ $VAR =~ ^KAFKA_ && ! $VAR =~ ^KAFKA_HOME ]]; then
91 kafka_name=$(echo "$VAR" | sed -r 's/KAFKA_(.*)=.*/\1/g' | tr '[:upper:]' '[:lower:]' | tr _ .)
92 env_var=$(echo "$VAR" | sed -r 's/(.*)=.*/\1/g')
93 if grep -E -q '(^|^#)'"$kafka_name=" "$KAFKA_HOME/config/server.properties"; then
94 sed -r -i 's#(^|^#)('"$kafka_name"')=(.*)#\2='"${!env_var}"'#g' "$KAFKA_HOME/config/server.properties" #note that no config values may contain an '#' char
95 else
96 echo "$kafka_name=${!env_var}" >> "$KAFKA_HOME/config/server.properties"
97 fi
98 fi
99
100 if [[ $VAR =~ ^LOG4J_ ]]; then
101 log4j_name=$(echo "$VAR" | sed -r 's/(LOG4J_.*)=.*/\1/g' | tr '[:upper:]' '[:lower:]' | tr _ .)
102 log4j_env=$(echo "$VAR" | sed -r 's/(.*)=.*/\1/g')
103 if grep -E -q '(^|^#)'"$log4j_name=" "$KAFKA_HOME/config/log4j.properties"; then
104 sed -r -i 's#(^|^#)('"$log4j_name"')=(.*)#\2='"${!log4j_env}"'#g' "$KAFKA_HOME/config/log4j.properties" #note that no config values may contain an'#' char
105 else
106 echo "$log4j_name=${!log4j_env}" >> "$KAFKA_HOME/config/log4j.properties"
107 fi
108 fi
109 done

Update: They have fixed it and it is merged now!
https://github.com/wurstmeister/kafka-docker/pull/321
There's a bug open now with wurstmeister/kafka but they have gotten back to me with a workaround as follows
I believe his is part of a larger namespace collision problem that
affects multiple elements such as Kubernetes deployments etc (as well
as other KAFKA_ service settings).
Given you are referencing an external file /kafka_server_jaas.conf,
i'm assuming you're OK adding/mounting extra files through; a
work-around is to specify a CUSTOM_INIT_SCRIPT environment var, which
should be a script similar to:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.auth.login.config=/kafka_server_jaas.conf"
This is executed after the substitution part that is failing.
This could have been done inline, however there is currently a bug in
how we process the environment, where we need to specify the input
separator to make this work correctly.
Hopefully this works!

Command line grep gives wrong results

I have two files, one with the patterns of a grep search, and the other is the target file where I want to look for patterns.
File 1 looks like this:
Cre01.g001800
Cre01.g001950
g46
g46
Cre01.g002050
Cre01.g002150
RPB6
g51
Cre01.g002201
Cre01.g002201
Cre01.g002236
Cre01.g002236
Cre01.g002300
And my second file looks like this:
chromosome_12 scriptAPG exon 3691112 3693536 . + . gene_id "RPS11";transcript_id "RPS11"
chromosome_9 scriptAPG exon 3011840 3038275 . - . gene_id "Cre09.g387400";transcript_id "Cre09.g387400"
chromosome_9 scriptAPG exon 2571100 2572801 . + . gene_id "Cre09.g390678";transcript_id "Cre09.g390678"
chromosome_14 scriptAPG exon 3804470 3817534 . + . gene_id "Cre14.g632650";transcript_id "Cre14.g632650"
chromosome_3 scriptAPG exon 4400340 4417459 . + . gene_id "Cre03.g175600";transcript_id "Cre03.g175600"
scaffold_40 scriptAPG exon 36 2671 . + . gene_id "g18380";transcript_id "g18380"
chromosome_6 scriptAPG exon 7445801 7457337 . - . gene_id "Cre06.g300050";transcript_id "Cre06.g300050"
chromosome_17 scriptAPG exon 584317 595135 . + . gene_id "Cre17.g699950";transcript_id "Cre17.g699950"
My aim is to extract the rows in file2 that have some pattern in file1. So, I throw the following command:
grep -w -F -f ".$out."/localization/prep/translator.txt localization/prep/locations.gtf > localization/prep/genemodels.gtf
What I am not understanding is: why the output file from the grep search, genemodels.gtf, have lines without matching patterns? It prints all the lines!! Doing this:
grep Cre01.g003600 translator.txt
I do not obtain any results. However, doing this:
grep Cre01.g003600 genemodels.gtf
I obtain:
chromosome_1 scriptAPG exon 688996 690516 . + . gene_id "Cre01.g003600";transcript_id "Cre01.g003600"
Do you know what I am doing wrong? Could it be the point . that some patterns to look for (e. g. Cre01.g001950) have? Grep does not sees that as a literal point. How to avoid that?
Thanks.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Find specific words after a match - grep

GNU grep, using the perl regex flag: grep -Po '(?<=\Wgene_id ")[^"]+' POSIX sed: sed -En 's/.[^[:alnum:]_]gene_id "([^"]+)./\1/p' If there are multiple occurrences per line, the grep will print all of them, but the sed will print the last occurrence only.

Related

Search for part of string with grep in all files in folder and subfolders

Why these patterns return same result?

Jenkins pipeline failing for sed command

docker-compose wurstmeister/kafka failing to parse KAFKA_OPTS

Command line grep gives wrong results

Categories

Resources

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Find specific words after a match - grep

GNU grep, using the perl regex flag: grep -Po '(?<=\Wgene_id ")[^"]+' POSIX sed: sed -En 's/.*[^[:alnum:]_]gene_id "([^"]+).*/\1/p' If there are multiple occurrences per line, the grep will print all of them, but the sed will print the last occurrence only.

Related

Search for part of string with grep in all files in folder and subfolders

Why these patterns return same result?

Jenkins pipeline failing for sed command

docker-compose wurstmeister/kafka failing to parse KAFKA_OPTS

Command line grep gives wrong results

Categories

Resources

GNU grep, using the perl regex flag: grep -Po '(?<=\Wgene_id ")[^"]+' POSIX sed: sed -En 's/.[^[:alnum:]_]gene_id "([^"]+)./\1/p' If there are multiple occurrences per line, the grep will print all of them, but the sed will print the last occurrence only.