tr on mac: misplaced sequence asterisk - tr

I'm very new to all of this so please excuse any mistakes.
I'm working on on a mac.
I'm trying to follow this tutorial here
When I type in tr "[ -%,;\(\):=\.\\\*[]\"\']" "_" < hug_tol.fasta > hug_tol.clean.fasta
I get the error message tr:misplaced sequence asterisk
I'm guessing that something in the file must be wrong, but since I'm trying to remove those characters the error message doesn't make sense.
I haven't found anything on Google so maybe someone can help me.

The author of the tutorial appears to be using quasi-regex character class syntax for tr. tr is much more limited in it's scope than that. It only accepts a few escape characters and special characters. Simplify your command to
tr "%,;():=.*[]\"\' \\\\\-" "_" < hug_tol.fasta > hug_tol.clean.fasta
The - character does have special meaning, so put it at the end: in the beginning it will be interpreted as a command-line argument, while in the middle it specifies a character range. In bash, * won't be expanded in double quotes. For tr, to specify a plain \, you need a double \ (since it's the escape character). To get that through bash, you need \\\\.
You may also want to consider using the -c option to specify the complement set (the characters you want to keep), since it is probably much smaller:
tr -c "A-Za-z0-9_" "_" < hug_tol.fasta > hug_tol.clean.fasta
or more tersely
tr -c "[:alnum:]" "_" < hug_tol.fasta > hug_tol.clean.fasta

Related

select only a word that is part of colon

I have a text file using markup language (similar to wikipedia articles)
cat test.txt
This is a sample text having: colon in the text. and there is more [[in single or double: brackets]]. I need to select the first word only.
and second line with no [brackets] colon in it.
I need to select the word "having:" only because that is part of regular text. I tried
grep -v '[*:*]' test.txt
This will correctly avoid the tags, but does not select the expected word.
The square brackets specify a character class, so your regular expression looks for any occurrence of one of the characters * or : (or *, but we said that already, didn't we?)
grep has the option -o to only print the matching text, so something lie
grep -ow '[^[:space:]]*:[^[:space:]]*' file.txt
would extract any text with a colon in it, surrounded by zero or more non-whitespace characters on each side. The -w option adds the condition that the match needs to be between word boundaries.
However, if you want to restrict in which context you want to match the text, you will probably need to switch to a more capable tool than plain grep. For example, you could use sed to preprocess each line to remove any bracketed text, and then look for matches in the remaining text.
sed -e 's/\[.*]//g' -e 's/ [^: ]*$/ /' -e 's/[^: ]* //g' -e 's/ /\n/' file.txt
(This assumes that your sed recognizes \n in the replacement string as a literal newline. There are simple workarounds available if it doesn't, but let's not go there if it's not necessary.)
In brief, we first replace any text between square brackets. (This needs to be improved if your input could contain multiple sequences of square brackets on a line with normal text between them. Your example only shows nested square brackets, but my approach is probably too simple for either case.) Then, we remove any words which don't contain a colon, with a special provision for the last word on the line, and some subsequent cleanup. Finally, we replace any remaining spaces with newlines, and (implicitly) print whatever is left. (This still ends up printing one newline too many, but that is easy to fix up later.)
Alternatively, we could use sed to remove any bracketed expressions, then use grep on the remaining tokens.
sed -e :a -e 's/\[[^][]*\]//' -e ta file.txt |
grep -ow '[^[:space:]]*:[^[:space:]]*'
The :a creates a label a and ta says to jump back to that label and try again if the regex matched. This one also demonstrates how to handle nested and repeated brackets. (I suppose it could be refactored into the previous attempt, so we could avoid the pipe to grep. But outlining different solution models is also useful here, I suppose.)
If you wanted to ensure that there is at least one non-colon character adjacent to the colon, you could do something like
... file.txt |
grep -owE '[^:[:space:]]+:[^[:space:]]*|[^[:space:]]*:[^: [:space:]]+'
where the -E option selects a slightly more modern regex dialect which allows us to use | between alternatives and + for one or more repetitions. (Basic grep in 1969 did not have these features at all; much later, the POSIX standard grafted them on with a slightly wacky syntax which requires you to backslash them to remove the literal meaning and select the metacharacter behavior... but let's not go there.)
Notice also how [^:[:space:]] matches a single character which is not a colon or a whitespace character, where [:space:] is the (slightly arcane) special POSIX named character class which matches any whitespace character (regular space, horizontal tab, vertical tab, possibly Unicode whitespace characters, depending on locale).
Awk easily lets you iterate over the tokens on a line. The requirement to ignore matches within square brackets complicates matters somewhat; you could keep a separate variable to keep track of whether you are inside brackets or not.
awk '{ for(i=1; i<=NF; ++i) {
if($i ~ /\]/) { brackets=0; next }
if($i ~ /\[/) brackets=1;
if(brackets) next;
if($i ~ /:/) print $i }' file.txt
This again hard-codes some perhaps incorrect assumptions about how the brackets can be placed. It will behave unexpectedly if a single token contains a closing square bracket followed by an opening one, and has an oversimplified treatment of nested brackets (the first closing bracket after a series of opening brackets will effectively assume we are no longer inside brackets).
A combined solution using sed and awk:
sed 's/ /\n/g' test.txt | gawk 'i==0 && $0~/:$/{ print $0 }/\[/{ i++} /\]/ {i--}'
sed will change all spaces to a newline
awk (or gawk) will output all lines matching $0~/:$/, as long as i equals zero
The last part of the awk stuff keeps a count of the opening and closing brackets.
Another solution using sed and grep:
sed -r -e 's/\[.*\]+//g' -e 's/ /\n/g' test.txt | grep ':$'
's/\[.*\]+//g' will filter the stuff between brackets
's/ /\n/g' will replace a space with a newline
grep will only find lines ending with :
A third on using only awk:
gawk '{ for (t=1;t<=NF;t++){
if(i==0 && $t~/:$/) print $t;
i=i+gsub(/\[/,"",$t)-gsub(/\]/,"",$t) }}' test.txt
gsub returns the number of replacements.
The variable i is used to count the level of brackets. On every [ it is incremented by 1, and on every ] it is decremented by one. This is done because gsub(/\[/,"",$t) returns the number of replaced characters. When having a token like [[][ the count is increased by (3-1=) 2. When a token has brackets AND a semicolon my code will fail, because the token will match, if it ends with a :, before the count of the brackets.

Double \\ in regular expression iOS

Does anyone understand what this (([A-Za-z\\s])+)\\? means?
I wonder why it should be "\\s" and "\\" ?
If I entered "\s", Xcode just doesn't understand and if I entered "\?", it just doesn't match the "?".
I have googled a lot, but I did not find a solution. Anyone knows?
The actual regex is (([A-Za-z\s])+)\?. This matches one or more letters and whitespace characters followed by an question mark. The \ has two different meanings here. In the first instance \s has a fixed meaning and stands for any white space characters. In the second instance the \? means the literal question mark character. The escaping is necessary as the question mark means one or none of the previous otherwise.
You can't type your regex like this in a string literal in C code though. C also does some escaping using the backslash character. For example "\n" is translated to a string containing only a newline character. There are some other escape sequences with special meanings. If the character after the backslash doesn't have a special meaning the backslash is just removed. That means if you want to have a single backspace in your string you have to write two.
So if you wrote your regex string as you wanted you'd get different results as it would be interpreted as (([A-Za-zs])+)? which has a completely different meaning. So when you write a regex in an ObjC (or any other C-based language) string literal you must double all backslash characters.
not sure about ios but same thing happens in java. \ is escape character for java,and c also so when you type \s java reads \ as an escape character.
think of it as if you want to print a \ what will you have to do.
you will have to type \\. now first \ will work as escape character for java and second one will be printed.
I think it should be the same concept for ios too.
so if you want \s you type \s, if you want \ you type \\.
The \s metacharacter is used to find a whitespace character.
Refer this!

How to define a ruby array that contains a backslash("\") character?

I want to define an array in ruby in following manner
A = ["\"]
I am stuck here for hours now. Tried several possible combinations of single and double quotes, forward and backward slashes. Alas !!
I have seen this link as well : here
But couldn't understand how to resolve my problem.
Apart from this what I need to do is -
1. Read a file character by character (which I managed to do !)
2. This file contains a "\" character
3. I want to do something if my array A includes this backslash
A.includes?("\")
Any help appreciated !
There are some characters which are special and need to be escaped.
Like when you define a string
str = " this is test string \
and this contains multiline data \
do you understand the backslash meaning here \
it is being used to denote the continuation of line"
In a string defined in a double quotes "", if you need to have a double quote how would you doo that? "\"", this is why when you put a backslash in a string you are telling interpretor you are going to use some special characters and which are escaped by backslash. So when you read a "\" from a file it will be read as "\" this into a ruby string.
char = "\\"
char.length # => 1
I hope this helps ;)
Your issue is not with Array, your question really involves escape sequences for special characters in strings. As the \ character is special, you need to first prepend it (escape it) with a leading backslash, like so.
"\\"
You should also re-read your link and the section on escape sequences.
You can escape backslash with a backslash in double quotes like:
["\\"].include?("\\")

Grep for beginning and end of line?

I have a file where I want to grep for lines that start with either -rwx or drwx AND end in any number.
I've got this, but it isnt quite right. Any ideas?
grep [^.rwx]*[0-9] usrLog.txt
The tricky part is a regex that includes a dash as one of the valid characters in a character class. The dash has to come immediately after the start for a (normal) character class and immediately after the caret for a negated character class. If you need a close square bracket too, then you need the close square bracket followed by the dash. Mercifully, you only need dash, hence the notation chosen.
grep '^[-d]rwx.*[0-9]$' "$#"
See: Regular Expressions and grep for POSIX-standard details.
It looks like you were on the right track... The ^ character matches beginning-of-line, and $ matches end-of-line. Jonathan's pattern will work for you... just wanted to give you the explanation behind it
It should be noted that not only will the caret (^) behave differently within the brackets, it will have the opposite result of placing it outside of the brackets. Placing the caret where you have it will search for all strings NOT beginning with the content you placed within the brackets. You also would want to place a period before the asterisk in between your brackets as with grep, it also acts as a "wildcard".
grep ^[.rwx].*[0-9]$
This should work for you, I noticed that some posters used a character class in their expressions which is an effective method as well, but you were not using any in your original expression so I am trying to get one as close to yours as possible explaining every minor change along the way so that it is better understood. How can we learn otherwise?
You probably want egrep. Try:
egrep '^[d-]rwx.*[0-9]$' usrLog.txt
are you parsing output of ls -l?
If you are, and you just want to get the file name
find . -iname "*[0-9]"
If you have no choice because usrLog.txt is created by something/someone else and you absolutely must use this file, other options include
awk '/^[-d].*[0-9]$/' file
Ruby(1.9+)
ruby -ne 'print if /^[-d].*[0-9]$/' file
Bash
while read -r line ; do case $line in [-d]*[0-9] ) echo $line; esac; done < file
Many answers provided for this question. Just wanted to add one more which uses bashism-
#! /bin/bash
while read -r || [[ -n "$REPLY" ]]; do
[[ "$REPLY" =~ ^(-rwx|drwx).*[[:digit:]]+$ ]] && echo "Got one -> $REPLY"
done <"$1"
#kurumi answer for bash, which uses case is also correct but it will not read last line of file if there is no newline sequence at the end(Just save the file without pressing 'Enter/Return' at the last line).

Regular expression in Ruby

Could anybody help me make a proper regular expression from a bunch of text in Ruby. I tried a lot but I don't know how to handle variable length titles.
The string will be of format <sometext>title:"<actual_title>"<sometext>. I want to extract actual_title from this string.
I tried /title:"."/ but it doesnt find any matches as it expects a closing quotation after one variable from opening quotation. I couldn't figure how to make it check for variable length of string. Any help is appreciated. Thanks.
. matches any single character. Putting + after a character will match one or more of those characters. So .+ will match one or more characters of any sort. Also, you should put a question mark after it so that it matches the first closing-quotation mark it comes across. So:
/title:"(.+?)"/
The parentheses are necessary if you want to extract the title text that it matched out of there.
/title:"([^"]*)"/
The parentheses create a capturing group. Inside is first a character class. The ^ means it's negated, so it matches any character that's not a ". The * means 0 or more. You can change it to one or more by using + instead of *.
I like /title:"(.+?)"/ because of it's use of lazy matching to stop the .+ consuming all text until the last " on the line is found.
It won't work if the string wraps lines or includes escaped quotes.
In programming languages where you want to be able to include the string deliminator inside a string you usually provide an 'escape' character or sequence.
If your escape character was \ then you could write something like this...
/title:"((?:\\"|[^"])+)"/
This is a railroad diagram. Railroad diagrams show you what order things are parsed... imagine you are a train starting at the left. You consume title:" then \" if you can.. if you can't then you consume not a ". The > means this path is preferred... so you try to loop... if you can't you have to consume a '"' to finish.
I made this with https://regexper.com/#%2Ftitle%3A%22((%3F%3A%5C%5C%22%7C%5B%5E%22%5D)%2B)%22%2F
but there is now a plugin for Atom text editor too that does this.

Resources