I have a dataset that looks something like this:
chr1 StringTie exon 197757319 197757401 1000 + . gene_id "MSTRG.10429"; transcript_id "ENST00000440885.1"; exon_number "1"; gene_name "RP11-448G4.4"; ref_gene_id "ENSG00000224901.1";
chr1 StringTie exon 197761802 197761965 1000 + . gene_id "MSTRG.10429"; transcript_id "ENST00000440885.1"; exon_number "2"; gene_name "RP11-448G4.4"; ref_gene_id "ENSG00000224901.1";
chr9 StringTie exon 63396911 63397070 1000 - . gene_id "MSTRG.145111"; transcript_id "MSTRG.145111.1"; exon_number "1";
chr9 StringTie exon 63397111 63397185 1000 - . gene_id "MSTRG.145111"; transcript_id "MSTRG.145111.1"; exon_number "2";
chr21 StringTie exon 44884690 44884759 1000 + . gene_id "MSTRG.87407"; transcript_id "MSTRG.87407.1"; exon_number "1";
chr22 HAVANA exon 19667023 19667199 . + . gene_id "ENSG00000225007.1"; transcript_id "ENST00000452326.1"; exon_number "1"; gene_name "AC000067.1";
chr22 HAVANA exon 19667446 19667555 . + . gene_id "ENSG00000225007.1"; transcript_id "ENST00000452326.1"; exon_number "2"; gene_name "AC000067.1";
I want to isolate the gene_ids. Therefore, the desired output is:
MSTRG.10429
MSTRG.10429
MSTRG.145111
MSTRG.145111
MSTRG.87407
ENSG00000225007.1
ENSG00000225007.1
I've tried the following:
grep -E -o "gene_id.{0,20}" gtf_om_ENSGids_te_vinden.gtf > alle_gene_ids.txt
With this I can grep the 20 characters after "gene_id" and I wanted to later remove the other characters which do not belong to the answer such as parts of the word "transcript". However, a problem is that the ref_gene_ids also get copied, which does not belong to the desired output. I tried to solve this by adding the -w flag, but this is also wrong for some reason. Can anyone help?
Thanks!
GNU grep, using the perl regex flag:
grep -Po '(?<=\Wgene_id ")[^"]+'
POSIX sed:
sed -En 's/.*[^[:alnum:]_]gene_id "([^"]+).*/\1/p'
If there are multiple occurrences per line, the grep will print all of them, but the sed will print the last occurrence only.
Use:
grep -o -E ' gene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf | sed -E 's/gene_id|"| //g'
the space in ' gene_id is needed to make sure the ref_gene_id is not matched.
The sed part will remove gene_id, the space, and the double quotes.
see: https://regex101.com/r/TDA7Cg/1
EDIT: Because of the tab, which is not a space:
Change it to
grep -o -E '[ \t]gene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf | sed -E 's/gene_id|"| //g'
or to just find the start of the word you could to
grep -o -E '\Wgene_id \"([^"]*)\"' gtf_om_ENSGids_te_vinden.gtf | sed -E 's/gene_id|"| //g'
But still the accepted answer is a nicer way to do it ... 😉
Related
I have .html files in directories and subdirectories. I need to extract all strings that starts with "domain.com". Part of string can look like this:
["https://example.com/folder1",
href="https://example.com/anotherfolder2" target="
etc.
What I want to extract is:
folder1
anotherfolder2
etc.
from all files in all folders to one list, each word - new line.
Found some examples on StackOverflow with many likes, but not worked. I tried like this (from some examples):
grep -Po '(?<=example.com=)[^,]*'
Thank you for help!
grep "example.com" your-directory -r | grep -o '".*"' | cut -d \" -f2| sed -e 's/https:\/\/example.com\///g'
grep "example.com" your-directory -r | grep -o '".*"' your-directory -r | cut -d \" -f2 extracts the content of quoted string
sed -e 's/https:\/\/example.com\///g' get the suffix of https://example.com/
echo "https://example.com/folder1" | tr -s '/' | tr '/' '\n' > file
sed -i '1d' file
sed -n '1p' file # This will give you example.com
sed -n '2p' file # This will give you folder1
sed -i 1s'#example\.com#newsite.com#' file
echo "http://" > nf
sed -n '2,$p' file >> nf
cat nf | tr '\n' '/' > newfile
cat newfile # This should be http://newsite.com/folder1
rm -v ./nf
I saw this question: count (non-blank) lines-of-code in bash
I understand this pattern is correct.
grep -vc ^$ filename
Why this pattern returns same result?
grep -c '[^ ]' filename
What is trick in '[^ ]'?
$ printf 'foo 123\n \nxyz\n\t\n' > ip.txt
$ cat -T ip.txt
foo 123
xyz
^I
$ grep -vc '^$' ip.txt
4
$ grep -c '[^ ]' ip.txt
3
$ grep -c '[^[:blank:]]' ip.txt
2
grep -c '[^ ]' counts any line that has a non-space character. For example, foo 123 will be counted since alphabets are not space characters. So, which one to use depends on whether a line containing only space characters should be counted or not.
I'm running a sed command in pipeline but it is failing with an error
sed: file - line 2: unterminated `s' command
here is my code for sed in pipeline:
sh '''sed '/^[[:blank:]]*$/d;s|^|s/%%|;s|:|%%/|;s|$|/|' key.txt | sed -f - file1.txt > file2.txt'''
I if I will run only first sed command, Jenkins gives very strange output, here it is from Jenkins logs its adding extra lines with characters:
+ sed '/^[[:blank:]]*$/d;s|^|s#%%|;s|:|%%#|;s|$|#|' keys.txt
s#%%sql_server_name%%#test_seqserver_1234
#
s#%%
#
s#%%sql_login_name%%#test_login_name
#
s#%%
#
s#%%password%%#test_password
#
s#%%
#
If run following command this is what Jenkins output looks like
sh '''sed '/^[[:blank:]]*$/d;/:/!d;s|^|s/%%|;s|:|%%/|;s|$|/|'
keys.txt'''
Jenkins output:
+ sed '/^[[:blank:]]*$/d;/:/!d;s|^|s/%%|;s|:|%%/|;s|$|/|' keys.txt
s/%%sql_server_name%%/test_seqserver_1234
/
s/%%sql_login_name%%/test_login_name
/
s/%%password%%/test_password
/
s/%%SID%%/123456
/
I ran the new command:
sh '''sed -e '/:/!d;s|^\\([^:]*\\):\\(.*\\)$|s/%%\\1%%/\\2/|' -e 'N;s|\\n/|/|' keys.txt'''
Here is the output:
Running shell script
+ sed -e '/:/!d;s|^\([^:]*\):\(.*\)$|s/%%\1%%/\2/|' -e 'N;s|\n/|/|'
keys.txt
s/%%sql_server_name%%/test_seqserver_1234
/
s/%%sql_login_name%%/test_login_name
/
s/%%password%%/test_password
/
s/%%SID%%/123456
Here is xxd output for text file:
Running shell script
+ xxd keys.txt
0000000: 7371 6c5f 7365 7276 6572 5f6e 616d 653a sql_server_name:
0000010: 7465 7374 5f73 6571 7365 7276 6572 5f31 test_seqserver_1
0000020: 3233 340d 0a0d 0a73 716c 5f6c 6f67 696e 234....sql_login
0000030: 5f6e 616d 653a 7465 7374 5f6c 6f67 696e _name:test_login
0000040: 5f6e 616d 650d 0a0d 0a70 6173 7377 6f72 _name....passwor
0000050: 643a 7465 7374 5f70 6173 7377 6f72 6420 d:test_password
0000060: 0d0a 0d0a 5349 443a 3132 3334 3536 200d ....SID:123456 .
0000070: 0a0d 0a64 6566 6175 6c74 5f64 6174 6162 ...default_datab
0000080: 6173 653a 7465 6d70 6462 0d0a 0d0a 6465 ase:tempdb....de
0000090: 6661 756c 745f 6c61 6e67 7561 6765 3a75 fault_language:u
00000a0: 735f 656e 676c 6973 680d 0a0d 0a63 6865 s_english....che
00000b0: 636b 5f65 7870 6972 6174 696f 6e3a 4f46 ck_expiration:OF
00000c0: 460d 0a0d 0a63 6865 636b 5f70 6f6c 6963 F....check_polic
00000d0: 793a 4f46 460d 0a0d 0a64 656c 6976 6572 y:OFF....deliver
00000e0: 7974 7970 653a 7363 6865 6475 6c65 640d ytype:scheduled.
00000f0: 0a0d 0a73 6368 6564 756c 6564 5f64 656c ...scheduled_del
0000100: 6976 6572 7964 6174 653a 3035 2d33 302d iverydate:05-30-
0000110: 3230 3939 0d0a 0d0a 7363 6865 6475 6c65 2099....schedule
0000120: 645f 6465 6c69 7665 7279 5f32 3468 725f d_delivery_24hr_
0000130: 6365 6e74 7261 6c5f 7469 6d65 3a31 3135 central_time:115
0000140: 3920 0d0a 0d0a 0d0a 0d0a 0d0a 0d0a 0d0a 9 ..............
It looks like your key.txt file has incorrect value. Judging from the first sed command:
sed '/^[[:blank:]]*$/d; s|^|s/%%|; s|:|%%/|; s|$|/|' key.txt
it expects each line to contain a semicolon. Then it forms sed code for the second sed command:
sed -f - file1.txt > file2.txt
If your key.txt contains non-empty lines without a semicolon, you will get the error unterminated 's' command.
Ensure that key.txt is correct, or at least add /:/!d; into your pipeline. Like this:
sh '''sed '/^[[:blank:]]*$/d;/:/!d;s|^|s/%%|;s|:|%%/|;s|$|/|' key.txt | sed -f - file1.txt > file2.txt'''
For example, correct key.txt contents:
username:server1
Incorrect key.txt:
username server2
There is no semicolon in this line, so it will cause error.
You might try to replace your first sed command with a simpler one:
sed '/:/!d;s|^\([^:]*\):\(.*\)$|s/%%\1%%/\2/|' key.txt
or better:
sed -e '/:/!d;s|^\([^:]*\):\(.*\)$|s/%%\1%%/\2/|' -e 'N;s|\n/|/|' key.txt
If that doesn't help, run xxd key.txt or hexdump -C key.txt and post the output.
After you added hex contents of your key.txt file, I finally could replicate the issue on my machine. The problem could be solved by this command:
sed -e '/:/!d;s|^\([^:]*\):\(.*\)\r|s/%%\1%%/\2/|' key.txt
So the trick is to use \r instead of $ in the first sed command. If it still doesn't work for you (it might, if you use MacOS), you can just remove carriage return from the key.txt file with a tool of your choice (like dos2unix) and then your original code should work.
I have a basic docker-compose file file for wurstmeister/kafka
I'm trying to configure it to use SASL_PLAIN with SSL
However I keep getting this error no matter how many ways I try to specify my jaas file
This is the error I get
[2018-04-11 10:34:34,545] FATAL [KafkaServer id=1001] Fatal error during KafkaServer startup. Prepare to shutdown (kafka.server.KafkaServer)
java.lang.IllegalArgumentException: Could not find a 'KafkaServer' or 'sasl_ssl.KafkaServer' entry in the JAAS configuration. System property 'java.security.auth.login.config' is not set
These are the vars I have. Last one is where I specify my jaas file
environment:
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
KAFKA_HOST_NAME: 10.10.10.1
KAFKA_PORT: 9092
KAFKA_ADVERTISED_PORT: 9093
KAFKA_ADVERTISED_HOST_NAME: 10.10.10.1
KAFKA_LISTENERS: PLAINTEXT://:9092,SASL_SSL://:9093
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://10.10.10.1:9092,SASL_SSL://10.10.10.1:9093
KAFKA_SECURITY_INTER_BROKER_PROTOCOL: SASL_SSL
KAFKA_SASL_ENABLED_MECHANISMS: PLAIN
SASL_MECHANISM_INTER_BROKER_PROTOCOL: PLAIN
KAFKA_SASL_MECHANISM_INTER_BROKER_PROTOCOL: PLAIN
KAFKA_SSL_TRUSTSTORE_LOCATION: /kafka.server.truststore.jks
KAFKA_SSL_TRUSTSTORE_PASSWORD: password
KAFKA_SSL_KEYSTORE_LOCATION: /kafka.server.keystore.jks
KAFKA_SSL_KEYSTORE_PASSWORD: password
KAFKA_SSL_KEY_PASSWORD: password
KAFKA_OPTS: '-Djava.security.auth.login.config=/path/kafka_server_jaas.conf'
Also when I try to check the docker logs I see
/usr/bin/start-kafka.sh: line 96: KAFKA_OPTS=-Djava.security.auth.login.config: bad substitution
Any help is greatly appreciated!
equals '=' inside the last value is causing this issue.
KAFKA_OPTS: '-Djava.security.auth.login.config=/path/kafka_server_jaas.conf'
This is what I have got after debugging.
+ for VAR in $(env)
+ [[ KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf =~ ^KAFKA_ ]]
+ [[ ! KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf =~
^KAFKA_HOME ]]
++ echo KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf
++ sed -r 's/KAFKA_(.*)=.*/\1/g'
++ tr '[:upper:]' '[:lower:]'
++ tr _ .
+ kafka_name=opts=-djava.security.auth.login.config
++ echo KAFKA_OPTS=-
Djava.security.auth.login.config=/path/kafka_server_jaas.conf
++ sed -r 's/(.*)=.*/\1/g'
+ env_var=KAFKA_OPTS=-Djava.security.auth.login.config
+ grep -E -q '(^|^#)opts=-djava.security.auth.login.config='
/opt/kafka/config/server.properties
start-kafka.sh: line 96: KAFKA_OPTS=-Djava.security.auth.login.config: bad
substitution
and this is the piece of code that is performing this operation.
88 for VAR in $(env)
89 do
90 if [[ $VAR =~ ^KAFKA_ && ! $VAR =~ ^KAFKA_HOME ]]; then
91 kafka_name=$(echo "$VAR" | sed -r 's/KAFKA_(.*)=.*/\1/g' | tr '[:upper:]' '[:lower:]' | tr _ .)
92 env_var=$(echo "$VAR" | sed -r 's/(.*)=.*/\1/g')
93 if grep -E -q '(^|^#)'"$kafka_name=" "$KAFKA_HOME/config/server.properties"; then
94 sed -r -i 's#(^|^#)('"$kafka_name"')=(.*)#\2='"${!env_var}"'#g' "$KAFKA_HOME/config/server.properties" #note that no config values may contain an '#' char
95 else
96 echo "$kafka_name=${!env_var}" >> "$KAFKA_HOME/config/server.properties"
97 fi
98 fi
99
100 if [[ $VAR =~ ^LOG4J_ ]]; then
101 log4j_name=$(echo "$VAR" | sed -r 's/(LOG4J_.*)=.*/\1/g' | tr '[:upper:]' '[:lower:]' | tr _ .)
102 log4j_env=$(echo "$VAR" | sed -r 's/(.*)=.*/\1/g')
103 if grep -E -q '(^|^#)'"$log4j_name=" "$KAFKA_HOME/config/log4j.properties"; then
104 sed -r -i 's#(^|^#)('"$log4j_name"')=(.*)#\2='"${!log4j_env}"'#g' "$KAFKA_HOME/config/log4j.properties" #note that no config values may contain an'#' char
105 else
106 echo "$log4j_name=${!log4j_env}" >> "$KAFKA_HOME/config/log4j.properties"
107 fi
108 fi
109 done
Update: They have fixed it and it is merged now!
https://github.com/wurstmeister/kafka-docker/pull/321
There's a bug open now with wurstmeister/kafka but they have gotten back to me with a workaround as follows
I believe his is part of a larger namespace collision problem that
affects multiple elements such as Kubernetes deployments etc (as well
as other KAFKA_ service settings).
Given you are referencing an external file /kafka_server_jaas.conf,
i'm assuming you're OK adding/mounting extra files through; a
work-around is to specify a CUSTOM_INIT_SCRIPT environment var, which
should be a script similar to:
#!/bin/bash
export KAFKA_OPTS="-Djava.security.auth.login.config=/kafka_server_jaas.conf"
This is executed after the substitution part that is failing.
This could have been done inline, however there is currently a bug in
how we process the environment, where we need to specify the input
separator to make this work correctly.
Hopefully this works!
I have two files, one with the patterns of a grep search, and the other is the target file where I want to look for patterns.
File 1 looks like this:
Cre01.g001800
Cre01.g001950
g46
g46
Cre01.g002050
Cre01.g002150
RPB6
g51
Cre01.g002201
Cre01.g002201
Cre01.g002236
Cre01.g002236
Cre01.g002300
And my second file looks like this:
chromosome_12 scriptAPG exon 3691112 3693536 . + . gene_id "RPS11";transcript_id "RPS11"
chromosome_9 scriptAPG exon 3011840 3038275 . - . gene_id "Cre09.g387400";transcript_id "Cre09.g387400"
chromosome_9 scriptAPG exon 2571100 2572801 . + . gene_id "Cre09.g390678";transcript_id "Cre09.g390678"
chromosome_14 scriptAPG exon 3804470 3817534 . + . gene_id "Cre14.g632650";transcript_id "Cre14.g632650"
chromosome_3 scriptAPG exon 4400340 4417459 . + . gene_id "Cre03.g175600";transcript_id "Cre03.g175600"
scaffold_40 scriptAPG exon 36 2671 . + . gene_id "g18380";transcript_id "g18380"
chromosome_6 scriptAPG exon 7445801 7457337 . - . gene_id "Cre06.g300050";transcript_id "Cre06.g300050"
chromosome_17 scriptAPG exon 584317 595135 . + . gene_id "Cre17.g699950";transcript_id "Cre17.g699950"
My aim is to extract the rows in file2 that have some pattern in file1. So, I throw the following command:
grep -w -F -f ".$out."/localization/prep/translator.txt localization/prep/locations.gtf > localization/prep/genemodels.gtf
What I am not understanding is: why the output file from the grep search, genemodels.gtf, have lines without matching patterns? It prints all the lines!! Doing this:
grep Cre01.g003600 translator.txt
I do not obtain any results. However, doing this:
grep Cre01.g003600 genemodels.gtf
I obtain:
chromosome_1 scriptAPG exon 688996 690516 . + . gene_id "Cre01.g003600";transcript_id "Cre01.g003600"
Do you know what I am doing wrong? Could it be the point . that some patterns to look for (e. g. Cre01.g001950) have? Grep does not sees that as a literal point. How to avoid that?
Thanks.