What's the difference between \r\n and \n - character-encoding

In Terminal it seems like no difference between the two
echo -en 'first\r\nsecond' and echo -en 'first\n\second'
but in the code without \r it doesn't work
echo -en 'GET /test HTTP/1.1\r\nHost: localhost\r\n\r\n' | nc localhost 9292
works, but
echo -en 'GET /test HTTP/1.1\r\nHost: localhost\n\n' | nc localhost 9292
doesn't
anyone can explain why it is?

Some applications can handle both \r\n (a.k.a. CRLF, carriage return line feed) and \n (a.k.a. LF, line feed) equivalently as newline sequences. Your terminal is an example.
The HTTP/1.1 Specification dictates that HTTP header lines always end with CRLF. So, an HTTP server which adheres to the specification (such as the one you're running on localhost:9292) will not interpret LF by itself as a valid HTTP header line termination sequence.

Related

how to avoid lookbehind assertion is not fixed length

I have a file that contains a version number that I need to output. This version number is apart of a string in this file, that looks something like this:
https://some-link:1234/path/to/file/name-of-file/1.2.345/name-of-file_CXP123456-1.2.345.jar"
I need to get the version number, which is 1.2.345.
This grep command works: grep -Po '(?<=/name-of-file_CXP123456-/)\d.\d.\d\d\d'. However, the CXP number changes and as such I thought I could do something like this: grep -Po '(?<=/name-of-file_*-/)\d.\d.\d\d\d' but that gives the following:
grep: lookbehind assertion is not fixed length
Is there anything I can add to the grep statement to avoid this?
Ultimately, this is part of a stage in Jenkins to get this version number. The sh command looks something like this:
VERSION = sh 'ssh -tt user#ip-address "cat dir/file*.content | grep -Po '(?<=/name-of-file_*-/)\d.\d.\d\d\d' 1>&2"'
You can use
grep -Po '/name-of-file_.*-\K\d+(?:\.\d+)+'
See the regex demo. Details:
/name-of-file_ - a literal text
.* - any zero or more chars other than line break chars as many as possible
- - a hyphen
\K - a match reset operator that omits all text matched so far from the memory buffer
\d+ - one or more digits
(?:\.\d+)+ - one or more sequences of a . and one or more digits.
You don't need lookbehind for this job. You also don't need PCREs, or grep at all.
#!/usr/bin/env bash
# ^^^^- bash, *not* sh
case $BASH_VERSION in '') echo "ERROR: bash required" >&2; exit 1;; esac
string="https://some-link:1234/path/to/file/name-of-file/1.2.345/name-of-file_CXP123456-1.2.345.jar"
regex='.*/name-of-file_CXP[[:digit:]]+-([[:digit:].]+)[.]jar'
if [[ $string =~ $regex ]]; then
echo "Version is ${BASH_REMATCH[1]}"
else
echo "No version found in $string"
fi
Maybe too long for a comment... It looks like the version number is the 2nd-to last field if you split on forward slash?
rev | cut -d/ -f 2 | rev
awk -F/ '{print $(NF-1)}'
perl -lanF/ -e 'print $F[-2]'
Or even something like: basename $(dirname $(cat filename))
For those that are really desperate there is another solution which requires you to pre-build your regex string.
It's not a solution I would recommend but if there is really no other way no one can stop you.
While even with this you won't have true dynamic look-behinds and it is still quite limited it is an option available to you.
The idea is to build the look-behind for each possible length you need it to be.
So for example only match if it's not preceded by a # (0 to a 100 characters look-behind).
reg='';
for ((i = 0 ; i <= 100 ; i++)); do reg+='(?<!#.{'"${i}"'})'; done;
reg+='someVariableName=.*?($|;|\\n)';
grep --perl-regexp "$reg" /usr/local/mgmsbox/msc/scripts/msc.cfg
This might not be the best example but it gets the idea across.
This solution has it's own pitfalls. For example you need to double escape \\ escape-sequences like \n and any character that should not be interpreted should be put in a single-quote string (or use printf).

Regex for line containing one or more spaces or dashes

I got .txt file with city names, each in separate line. Some of them are few words with one or multiple spaces or words connected with '-'. I need to create bash command which will echo those lines out. Currently I'm using cat piped with grep but I can't get both spaces and dash into one search and I had problems with checking for multiple spaces.
print lines with dash:
cat file.txt | grep ".*-.*"
print lines with spaces:
cat file.txt | grep ".*\s.*"
tho when I try to do:
cat file.txt | grep ".*\s+.*"
I get nothing.
Thanks for help
Something like that should work:
grep -E -- ' |\-' file.txt
Explanation:
-E: to interpret patterns as extended regular expressions
--: to signify the end of command options
' |\-': the line contains either a space or a dash
This does not directly address your question, but is too much to put in a comment.
You don't need the .* in your patterns. .* at the beginning or end of a pattern is useless, because it means "0 or more of any character" and so will always match.
These lines are all identical:
cat file.txt | grep ".*-.*"
cat file.txt | grep "-.*"
cat file.txt | grep "-"
Plus you don't need to cat and pipe:
grep "-" file.txt
When grep pattern matches, the default action is to print the whole line, so .* in all your patterns are redundant, you may delete them. Also, you don't have to use cat file | as you may specify the file to grep directly after pattern, i.e. grep 'pattern' file.txt.
Here are some more details:
grep ".*-.*" = grep -- "-" - returns any lines having a - char (-- singals the end of options, the next thing is the pattern)
grep ".*\s.*" = grep "\s" - matches and returns lines containing a whitespace char (only GNU grep)
grep ".*\s+.*" = grep "\s+" - returns line containing a whitespace followed with a literal + char (since you are using POSIX BRE regex here the unescaped + matches a literal plus symbol).
You want
grep "[[:space:]-]" file.txt
See the online demo:
#!/bin/bash
s='abc - def
ghi
jkl mno'
grep '[[:space:]-]' <<< "$s"
Output:
abc - def
jkl mno
The [[:space:]-] POSIX BRE and ERE (enabled with -E option) compliant pattern matches either any whitespace (with the [:space:] POSIX character class) or a hyphen.
Note that [\s-] won't work since \s inside a bracket expression is not treated as a regex escape sequence but as a mere \ or s.

How to grep a text in a file with new/breaks line

I have to parse the content of multiple files with this content:
style=3D""><a href=3D"https://123456789.com/accounts/confirm_email/19AbCDx=
K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1Mjk=
wODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ" style=3D"color:#3b599
I have to extract the https link, but my grep command can't ignore the new line return, and end with a trunk result:
COMMAND
grep -r -m1 -oh "https://123456789.com/accounts/confirm_email*\s*[^ ]*" /folder/
RESULT
https://123456789.com/accounts/confirm_email/19AbCDx=
DESIDERED RESULT
https://123456789.com/accounts/confirm_email/19AbCDx=K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1MjkwODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ
PS: '=' character is not (always) part of link, but it is the format of the file when break the line.
NB: https://123456789.com/accounts/confirm_email/ is the only constant of the link repeated in all files.
IF I add -z option, -m1 option is ignored and the result is:
https://123456789.com/accounts/confirm_email/19AbCDx=
K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1Mjk=
wODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ"https://123456789.com/accounts/confirm_email/19AbCDx=
K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1Mjk=
wODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ"https://123456789.com/accounts/confirm_email/19AbCDx=
K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1Mjk=
wODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ"
IF I add |head -3 after the command seem to work BUT http is repeated in the last line
COMMAND
grep -r -oh -z "https://123456789.com/accounts/confirm_email*\s*[^ ]*" /folder/ |head-3
https://123456789.com/accounts/confirm_email/19AbCDx=
K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1Mjk=
wODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ"https://123456789.com/accounts/confirm_email/19AbCDx=
How can I exclude it?
man grep:
-z, --null-data
Treat the input as a set of lines, each terminated by a zero
byte (the ASCII NUL character) instead of a newline. - -
So:
$ grep -z -r -m1 -oh "https://123456789.com/accounts/confirm_email*\s*[^ ]*" file
Output:
https://123456789.com/accounts/confirm_email/19AbCDx=
K/bWFyY29A1234529zYW50dWNjaS5ldQ/?app_redirect=3DFalse&ndid=3DHMTU1Mjk=
wODY5OTA1MDk2NTptYXJjb0BtYXJjb3NhbnR1Y2NpLmV1Ojg1OQ"
The newlines will still be there but you could delete them with tr -d \\n

Unix md5sum vs Powershell Get-hash

I am trying to generate md5 hash from Powershell. I installed Powershell Community Extension (Pscx) to get command : Get-Hash
However when I generate md5 hash using Get-Hash, it doesn't seem to match the hash generated using md5sum on an Ubuntu machine.
Powershell:
PS U:\> "hello world" | get-hash -Algorithm MD5
Path Algorithm HashString Hash
---- --------- ---------- ----
MD5 E42B054623B3799CB71F0883900F2764 {228, 43, 5, 70...}
Ubuntu:
root#LT-A03433:~# echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
I know that the one generated by Ubuntu is correct as a couple of online sites show the same result.
What am I going wrong with Powershell Get-Hash?
The difference is not obvious, but you are not hashing the same data. MD5 is a hashing algorithm, and it has no notion of text encoding – this is why you can create a hash of binary data just as easily as a hash of text. With that in mind, we can find out what bytes (or octets; strictly a stream of values of 8 bits each) MD5 is calculating the hash of. For this, we can use xxd, or any other hexeditor.
First, your Ubuntu example:
$ echo "hello world" | xxd
0000000: 6865 6c6c 6f20 776f 726c 640a hello world.
Note the 0a, Unix-style newline at the end, displayed as . in the right view. echo by default appends a newline to what it prints, you could use printf, but this would lead to a different hash.
$ echo "hello world" | md5
6f5902ac237024bdd0c176cb93063dc4
Now let's consider what PowerShell is doing. It is passing a string of its own directly to the get-hash cmdlet. As it turns out, the natural representation of string data in a lot of Windows is not the same as for Unix – Windows uses wide strings, where each character is represented (in memory) as two bytes. More specifically, we can open a text editor, paste in:
hello world
With no trailing newline, and save it as UTF-16, little-endian. If we examine the actual bytes this produces, we see the difference:
$ xxd < test.txt
0000000: 6800 6500 6c00 6c00 6f00 2000 7700 6f00 h.e.l.l.o. .w.o.
0000010: 7200 6c00 6400 r.l.d.
Each character now takes two bytes, with the second byte being 00 – this is normal (and is the reason why UTF-8 is used across the Internet instead of UTF-16, for example), since the Unicode codepoints for basic ASCII characters are the same as their ASCII representation. Now let's see the hash:
$ md5 < thefile.txt
e42b054623b3799cb71f0883900f2764
Which matches what PS is producing for you.
So, to answer your question – you're not doing anything wrong. You just need to encode your string the same way to get the same hash. Unfortunately I don't have access to PS, but this should be a step in the right direction: UTF8Encoding class.
This question is surely related to How to get an MD5 checksum in PowerShell, but it’s different and makes an important point.
Md5sums are computed from bytes. In fact, your Ubuntu result is, in a sense, wrong:
$ echo "hello world" | md5sum
6f5902ac237024bdd0c176cb93063dc4 -
$ echo -n "hello world" | md5sum
5eb63bbbe01eeed093cb22bb8f5acdc3 -
In the first case you sum the 12 bytes which make up the ASCII representation of your string, plus a final carriage return. In the second case, you don’t include the carriage return.
(As an aside, it is interesting to note that a here string includes a carriage return:)
$ md5sum <<<"hello world"
6f5902ac237024bdd0c176cb93063dc4
In Windows powershell, your string is represented in UTF-16LE, 2 bytes per character. To get the same result in Ubuntu and in Windows, you have to use a recoding program. A good choice for Ubuntu is iconv:
$ echo -n "hello world" | iconv -f UTF-8 -t UTF-16LE | md5sum
e42b054623b3799cb71f0883900f2764 -
md5sum is wrong-ish, in spite of other people agreeing with it. It is adding a platform-specific end-of-line characters to the input string, on unix an lf, on windows a cr-lf.
Verify this on a machine with powershell and bash and e.g. postgres installed for comparison:
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
echo 'A string with no CR or LF at the end' | md5sum.exe
'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
Output first two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(13) || Chr(10) )" }
md5
----------------------------------
1b16276b75aba6ebb88512b957d2a198
PS> echo 'A string with no CR or LF at the end' | md5sum.exe
1b16276b75aba6ebb88512b957d2a198 *-
Output second two lines:
PS> 'A string with no CR or LF at the end' | %{ psql -c "select md5('$_' || Chr(10) )" }
md5
----------------------------------
68a1fcb16b4cc10bce98c5f48df427d4
PS> bash -c "echo 'A string with no CR or LF at the end' | md5sum.exe"
68a1fcb16b4cc10bce98c5f48df427d4 *-

grep is unable to find all pattern matching "\[\[\[\["

I am having problems with using grep along with a pipe. The scenario is as follows:
I am running a python script that outputs (using print) to the screen debug messages. I use ./prog | grep "\[\[\[\[" to catch the strings with "[[[[" in them. It returns few matching results but not others (Another observation: results found by grep come before the results not found by grep in the file). I have ran the ./prog without pipe and grep and it outputs all the strings with "[[[[" pattern.
The problem is that the left square bracket is a special character in regular expressions. "grep" is not just a string matcher. Regular expressions are an involved language that let you describe patterns of text. Grep is trying to interpret [[[[ as a regular expression, not just a string.
As your question subject suggests, you can usually escape special characters with a backslash. So the following might work:
./prog | grep '\[\[\[\['
You can also "escape" square brackets by putting them inside square brackets. Thus, [[][[][[][[] or [[]{4} if your version of grep handles it.
You also need to determine whether your program, ./prog, is sending output to "standard output" or "standard error". You can put all your stderr through the pipe with:
./proc 2>&1 | egrep '[[]{4}'
UPDATE:
[ghoti#pc ~]$ printf '[[[[\n[[[\n[[[[\n[[[[[\n[[\n' | grep '\[\[\[\['
[[[[
[[[[
[[[[[
[ghoti#pc ~]$ printf '[[[[\n[[[\n[[[[\n[[[[[\n[[\n' | egrep '[[]{4}'
[[[[
[[[[
[[[[[
[ghoti#pc ~]$
Obviously, my results do not match yours. If you can provide more details as to the data you're processing, it will be helpful in trying to duplicate your results.
Error messages are usually sent to stderr, not stdout; your pipe is filtering stdout. (Your "another observation" hints at this.) You can redirect stderr along with stdout to the pipe:
./prog 2>&1 | grep '\[\[\[\['

Resources