What tools deal with spaces in columnar data well? - parsing

Let's start with an example that I ran into recently:
C:\>net user
User accounts for \\SOMESYSTEM
-------------------------------------------------------------------------------
ASPNET user1 AnotherUser123
Guest IUSR_SOMESYSTEM IWAM_SOMESYSTEM
SUPPORT_12345678 test userrrrrrrrrrrr test_userrrrrrrrrrrr
The command completed successfully.
In the third row, second column there is a login with a space. This causes many of the tools that separate fields based on white space to treat this field as two fields.
How would you deal with data formatted this way using today's tools?
Here is an example in pure** Windows batch language on the command prompt that I would like to have replicated in other modern cross-platform text processing tool sets:
C:\>cmd /v:on
Microsoft Windows [Version 5.2.3790]
(C) Copyright 1985-2003 Microsoft Corp.
C:\>echo off
for /f "skip=4 tokens=*" %g in ('net user ^| findstr /v /c:"The command completed successfully."') do (
More? set record=%g
More? echo !record:~0,20!
More? echo !record:~25,20!
More? echo !record:~50,20!
More? )
ASPNET
user1
AnotherUser123
Guest
IUSR_SOMESYSTEM
IWAM_SOMESYSTEM
SUPPORT_12345678
test userrrrrrrrrrrr
test_userrrrrrrrrrrr
echo on
C:\>
** Using variable delayed expansion (cmd /v:on or setlocal enabledelayedexpansion in a batch file), the for /f command output parser, and variable substring syntax... none of which are well documented except for at the wonderful website http://ss64.com/nt/syntax.html
Looking into AWK, I didn't see a way to deal with the 'test userrrrrrrrrrrr' login field without using substr() in a similar method to the variable substring syntax above. Is there another language that makes text wrangling easy and is not write-only like sed?

PowerShell:
Native user list example, no text matching needed
Get-WmiObject Win32_UserAccount | Format-Table -Property Caption -HideTableHeaders
Or, if you want to use "NET USER":
$out = net user # Send stdout to $out
$out = $out[4..($out.Length-3)] # Skip header/tail
[regex]::split($out, "\s{2}") | where { $_.Length -ne 0 }
# Split on double-space and skip empty lines

Just do a direct query for user accounts, using vbscript (or powershell if your system supports)
strComputer = "."
Set objWMIService = GetObject("winmgmts:\\" & strComputer & "\root\cimv2")
Set colItems = objWMIService.ExecQuery("Select * from Win32_UserAccount",,48)
For Each objItem in colItems
Wscript.Echo objItem.Name
Next
This will show you a list of users, one per line. If your objective is just to show user names, there is no need to use other tools to process thee data.

Awk isn't so great for that problem because awk is focused on lines as records with a recognizable field separator, while the example file uses fixed-width fields. You could, e.g., try to use a regular expression for the field separator, but that can go wrong. The right way would be to use that fixed width to clean the file up into something easier to work with; awk can do this, but it is inelegant.
Essentially, the example is difficult because it doesn't follow any clear rules. The best approach is a quite general one: write data to files in a well-defined format with a library function, read files by using a complementary library function. Specific language doesn't matter so much with this strategy. Not that that helps when you already have a file like the example.

TEST
printf "
User accounts for \\SOMESYSTEM
-------------------------------------------------------------------------------
ASPNET user1 AnotherUser123
Guest IUSR_SOMESYSTEM IWAM_SOMESYSTEM
SUPPORT_12345678 test userrrrrrrrrrrr test_userrrrrrrrrrrr
The command completed successfully.
\n" | awk 'BEGIN{
colWidth=25
}
/-----/ {next}
/^[[:space:]]*$/{next}
/^User accounts/{next}
/^The command completed/{next}
{
col1=substr($0,1,colWidth)
col2=substr($0,1+colWidth,colWidth)
col3=substr($0,1+(colWidth*2),colWidth)
printf("%s\n%s\n%s\n", col1, col2, col3)
}'
There's probably a better way than the 1+(colWidth*2) but I'm out of time for right now.
If you try to execute code as is, you'll have to remove the leading spaces at the front of each line in the printf statement.
I hope this helps.

For this part:
set record=%g
More? echo !record:~0,20!
More? echo !record:~25,20!
More? echo !record:~50,20!
I would use:
for /f "tokens=1-26 delims= " %a in (%g%) do (
if not "%a" = "" echo %a
if not "%b" = "" echo %b
if not "%c" = "" echo %c
rem ... and so on...
if not "%y" = "" echo %y
if not "%z" = "" echo %z
)
That is if I had to do this using batch. But I wouldn't dare to call this "modern" as per your question.

perl is really the best choice for your case, and millions of others. It is very common and the web is ripe with examples and documentation. Yes it is cross platform, extremely stable, and nearly perfectly consistent across platforms. I say nearly because nothing is perfect and I doubt in your lifetime that you would encounter an inconsistency.
It is a language interpreter but supports a rich command-line interface as well.

Related

duplicate grep output when comparing two files

I have literally been at this for 5 hours, I have busybox on my device, and I unfortunately do not have -X in grep to make my life easier.
edit;
I have two list both of them have mac addresses, essentially I am just wanting to achieve offline mac address lookup so I don't have to keep looking it up online
list.txt has vendor mac prefix of course this isn't the complete list but just for an example
00:13:46
00:15:E9
00:17:9A
00:19:5B
00:1B:11
00:1C:F0
scan will have list of different mac addresses unknown to which vendor they go to. Which will be full length mac addresses. when ever there is a match I want the line in scan to be output.
Pretty much it does that, but it outputs everything from the scan file, and then it will output matching one at the end, and causing duplicate. I tried sort -u, but it has no effect its as if there is two different output from two different methods, the reason why I say that is because it will instantly output scan file that has everything in it, and couple seconds later it will output the matching one.
From searching I came across this
#!/bin/bash
while read line; do
grep -F 'list' 'scan'
done < list.txt
which displays the duplicate result when/if found, the output is pretty much echoing my scan file then displaying the matched pattern, this creating duplicate
This is frustrating me that I have not found a solution after click on all the links in google up to page 9.
Please someone help me.
I don't know if the Busybox sed supports this out of the box, but it should be easy to do in Awk or Perl instead then.
Create a sed script to print lines from file2 which are covered by a prefix in file1 by transforming each line in file1 into a sed command to print a match for that regular expression:
sed 's%.*%/&/p%' file1 | sed -n -f - file2
The same in Awk:
awk 'NR==FNR { a[++i]="^" $0; next }
{ for (j=1; j<=i; ++j) if ($0 ~ a[j]) print }' file1 file2
Ok guys I did a nested for loop (probably very in efficient) but I got it working printing the matching mac addresses using this
#!/usr/bin/bash
for scanlist in `cat scan | cut -d: -f1,2,3`
do
for listt in `cat list`
do
if [[ $scanlist == $listt ]]; then
grep $scanlist scan
fi
done
done
if anyone can make this more elegant but it works for me for now. I think the problem I had was one list contained just 00:11:22 while my other list contained 00:11:22:33:44:55 that is why I cut it on my scanlist to make same length as my other list. So this only output the matches instead of doing duplicate output.

Using tr to change many files names

I am basically trying to replace all special characters in directory names and files names with a period. I am attempting to use tr, but I am very new and I do not want to mess up all of my music naming and picture naming. I am making the switch from windows to linux and trying to get everything in a nice formatted pattern. I have used tr semi successfully but I would like some pro help from you guys! Thanks in advance! I have looked at the tr man pages but I am just worried about really messing up 12 years of pictures and music file names! The two man characters I am trying to replace are " - " but the naming scheme I've used in windows has been by hand over the years and it varies, so I was hoping to go through everything and replace all cases of "-" or " - " manly but any fat fingering I have done over the years and put in something besides that patter would be great. I am thinking something like:
tr -cd [:alnum:] '.'
would this work?
My main goal is to turn something like
01 - Name Of Song (or any variation of special/punctuation characters)
into
01.Name.Of.Song
You don't want to use the d option since it just deletes the matched characters. And you may want to use the s (squeeze) option. Try this:
tr -cs '[:alnum:]' '.'
You can test it like this:
echo '01 - Name Of Song' | tr -cs '[:alnum:]' '.'
(Ignore the extra period at the end of the output. That's just the newline character and won't appear in filenames ... generally.)
But this is probably not of much use in your task. If you want to do a mass rename, you might use a little perl program like this:
#!/usr/bin/perl
$START_DIRECTORY = "Music";
$RENAME_DIRECTORIES = 1; # boolean (0 or 1)
sub procdir {
chdir $_[0];
my #files = <*>;
for my $file (#files) {
procdir($file) if (-d $file);
next if !$RENAME_DIRECTORIES;
my $oldname = $file;
if ($file =~ s/[^[:alnum:].]+/\./g) {
print "$oldname => $file\n";
# rename $oldname, $file; # may not rename directories(?)
}
}
chdir "..";
}
procdir($START_DIRECTORY);
Run it with the rename command commented out (as above) to test it. Uncomment the rename command to actually rename the files. Caveat emptor. There be dragons. Etc.

How to make the output of Maxima cleaner?

I want to make use of Maxima as the backend to solve some computations used in my LaTeX input file.
I did the following steps.
Step 1
Download and install Maxima.
Step 2
Create a batch file named cas.bat (for example) as follows.
rem cas.bat
echo off
set PATH=%PATH%;"C:\Program Files (x86)\Maxima-5.31.2\bin"
maxima --very-quiet -r %1 > solution.tex
Save the batch in the same directory in which your input file below exists. It is just for the sake of simplicity.
Step 3
Create the input file named main.tex (for example) as follows.
% main.tex
\documentclass[preview,border=12pt,12pt]{standalone}
\usepackage{amsmath}
\def\f(#1){(#1)^2-5*(#1)+6}
\begin{document}
\section{Problem}
Evaluate $\f(x)$ for $x=\frac 1 2$.
\section{Solution}
\immediate\write18{cas "x: 1/2;tex(\f(x));"}
\input{solution}
\end{document}
Step 4
Compile the input file with pdflatex -shell-escape main and you will get a nice output as follows.
!
Step 5
Done.
Questions
Apparently the output of Maxima is as follows. I don't know how to make it cleaner.
solution.tex
1
-
2
$${{15}\over{4}}$$
false
Now, my question are
how to remove such texts?
how to obtain just \frac{15}{4} without $$...$$?
(1) To suppress output, terminate input expressions with dollar sign (i.e. $) instead of semicolon (i.e. ;).
(2) To get just the TeX-ified expression sans the environment delimiters (i.e. $$), call tex1 instead of tex. Note that tex1 returns a string, which you have to print yourself (while tex prints it for you).
Combining these ideas with the stuff you showed, I think your program could look like this:
"x: 1/2$ print(tex1(\f(x)))$"
I think you might find the Maxima mailing list helpful. I'm pretty sure there have been several attempts to create a system such as the one you describe. You can also look at the documentation.
I couldn't find any way to completely clean up Maxima's output within Maxima itself. It always echoes the input line, and always writes some whitespace after the output. The following is an example of a perl script that accomplishes the cleanup.
#!/usr/bin/perl
use strict;
my $var = $ARGV[0];
my $expr = $ARGV[1];
sub do_maxima_to_tex {
my $m = shift;
my $c = "maxima --batch-string='exptdispflag:false; print(tex1($m))\$'";
my $e = `$c`;
my #x = split(/\(%i\d+\)/,$e); # output contains stuff like (%i1)
my $f = pop #x; # remove everything before the echo of the last input
while ($f=~/\A /) {$f=~s/\A .*\n//} # remove echo of input, which may be more than one line
$f =~ s/\\\n//g; # maxima breaks latex tokens in the middle at end of line; fix this
$f =~ s/\n/ /g; # if multiple lines, get it into one line
$f =~ s/\s+\Z//; # get rid of final whitespace
return $f;
}
my $e1 = do_maxima_to_tex("diff($expr,$var,1)");
my $e2 = do_maxima_to_tex("diff($expr,$var,2)");
print <<TEX;
The first derivative is \$$e1\$. Differentiating a second time,
we get \$$e2\$.
TEX
If you name this script a.pl, then doing
a.pl z 3*z^4
outputs this:
The first derivative is $12\,z^3$. Differentiating a second time,
we get $36\,z^2$.
For the OP's application, a script like this one could be what is invoked by the write18 in the latex file.
If you really want to use LaTeX then the maxiplot package is the answer. It provides a maxima environment inside of which you enter Maxima commands. When you process your LaTeX file a Maxima batch file is generated. Process this file with Maxima and process your LaTeX file again to typeset the equations generated by Maxima.
If you would rather have 2D math input with live typesetting then use TeXmacs. It is a cross-platform document authoring environment (a word processor on steroids if you like) that includes plugins for Maxima, Mathematica and many more scientific computing tools. If you need to or are not satisfied with the typesetting, you can export your document to LaTeX.
I know this is a very old post. Excellent answers for the question asked by OP. I was using --very-quiet -r options on the command line for a long time like OP, but in maxima version 5.43.2 they behave differently. See maxima command line v5.43 is behaving differently than v5.41. I am answering this question with a cross reference because when incorporating these answers in your solutions, make sure the changes in behavior of those command line flags are also incorporated.

Search for combinations of a phrase

What is the way to use 'grep' to search for combinations of a pattern in a text file?
Say, for instance I am looking for "by the way" and possible other combinations like "way by the" and "the way by"
Thanks.
Awk is the tool for this, not grep. On one line:
awk '/by/ && /the/ && /way/' file
Across the whole file:
gawk -v RS='\0' '/by/ && /the/ && /way/' file
Note that this is searching for the 3 words, not searching for combinations of those 3 words with spaces between them. Is that what you want?
Provide more details including sample input and expected output if you want more help.
The simplest approach is probably by using regexps. But this is also slightly wrong:
egrep '([ ]*(by|the|way)\>){3}'
What this does is to match on the group of your three words, taking spaces in front of the words
with it (if any) and forcing it to be a complete word (hence the \> at the end) and matching the string if any of the words in the group occurs three times.
Example of running it:
$ echo -e "the the the\nby the\nby the way\nby the may\nthe way by\nby the thermo\nbypass the thermo" | egrep '([ ]*(by|the|way)\>){3}'
the the the
by the way
the way by
As already said, this procudes a 'false' positive for the the the but if you can live with that, I'd recommend doing it this way.

Grep for beginning and end of line?

I have a file where I want to grep for lines that start with either -rwx or drwx AND end in any number.
I've got this, but it isnt quite right. Any ideas?
grep [^.rwx]*[0-9] usrLog.txt
The tricky part is a regex that includes a dash as one of the valid characters in a character class. The dash has to come immediately after the start for a (normal) character class and immediately after the caret for a negated character class. If you need a close square bracket too, then you need the close square bracket followed by the dash. Mercifully, you only need dash, hence the notation chosen.
grep '^[-d]rwx.*[0-9]$' "$#"
See: Regular Expressions and grep for POSIX-standard details.
It looks like you were on the right track... The ^ character matches beginning-of-line, and $ matches end-of-line. Jonathan's pattern will work for you... just wanted to give you the explanation behind it
It should be noted that not only will the caret (^) behave differently within the brackets, it will have the opposite result of placing it outside of the brackets. Placing the caret where you have it will search for all strings NOT beginning with the content you placed within the brackets. You also would want to place a period before the asterisk in between your brackets as with grep, it also acts as a "wildcard".
grep ^[.rwx].*[0-9]$
This should work for you, I noticed that some posters used a character class in their expressions which is an effective method as well, but you were not using any in your original expression so I am trying to get one as close to yours as possible explaining every minor change along the way so that it is better understood. How can we learn otherwise?
You probably want egrep. Try:
egrep '^[d-]rwx.*[0-9]$' usrLog.txt
are you parsing output of ls -l?
If you are, and you just want to get the file name
find . -iname "*[0-9]"
If you have no choice because usrLog.txt is created by something/someone else and you absolutely must use this file, other options include
awk '/^[-d].*[0-9]$/' file
Ruby(1.9+)
ruby -ne 'print if /^[-d].*[0-9]$/' file
Bash
while read -r line ; do case $line in [-d]*[0-9] ) echo $line; esac; done < file
Many answers provided for this question. Just wanted to add one more which uses bashism-
#! /bin/bash
while read -r || [[ -n "$REPLY" ]]; do
[[ "$REPLY" =~ ^(-rwx|drwx).*[[:digit:]]+$ ]] && echo "Got one -> $REPLY"
done <"$1"
#kurumi answer for bash, which uses case is also correct but it will not read last line of file if there is no newline sequence at the end(Just save the file without pressing 'Enter/Return' at the last line).

Resources