How to parse text file and retrieve only selected text?

How to parse text file and retrieve only selected text? - parsing

In my log file I have the text in the following format:
18 Mar 2001 14:18:17,438 INFO DomainName1\EmpId1#Admin#3.1
18 Mar 2001 14:19:00,872 INFO DomainName2\EmpId2#User#1.3.2.0
18 Mar 2001 14:20:05,418 INFO DomainName3\EmpId3#Admin#4.3.1.0
I just want to get only the EmpId's.

What about something like
cat logfile | cut -d '#' -f 1 | cut -d '\' -f 2
(This assumes that you are on a Unix-like system, and also assumes that '#' and '\' won't pop up elsewhere than where you put them in your example.)

Related

How do I grep starting at a specific point in a log file?

A have daily syslog files which contain syslog messages in the format: MMM DD HH:MM:SS additional_data_here
We make a change, then want to see if the syslog messages continue.
For example, say a change was made at 09:55. I want to ignore everything prior to the first line that contains Oct 29 09:55:00. Then I want to grep for my error message after that first line match.
For this example, I have to create several different statements, like this:
grep -e "Oct 29 09:5[5-9]" syslog20211029 | grep "[my message]"
grep -e "Oct 29 1[0-1]:" syslog20211029 | grep "[my message]"
But I do this often enough that I'd like to find a better, more consistent way. Something like:
start-at-first-match "Oct 29 09:55:00" syslog20211029 | grep "[my message]"
But I don't know what the start-at-first-match option is. Any suggestions?

If you want to restrict yourself to using grep, you can't really but with the option -A num it can still meet your need (giving a big number for num) :
grep -A 10000000 "Oct 29 09:55:00" syslog20211029
This will print the matching line and the next 10 million.
If you want everything that follows the line for sure (without having to give an "unreachable" number of lines), you have to use another command (like sed or awk). Using sed: sed -n '/Oct 29 09:55:00/,$ p' (with -n won't print the lines by default, and from the line you want, between /pattern/, to the end of file $ you ask sed to print the lines).

How to count the occurence of a string in a file, for all files in a directory and output into a new file with shell

I have hundreds of files in a directory that I would like to count the occurrence of a string in each file.
I would like the output to be a summary file that contains the original file name plus the count (ideally on the same line)
for example
file1 6
file2 3
file3 4
etc
Thanks for your consideration

CAUTION: I am pretty much an enthusiastic amateur, so take everything with a grain of salt.
Several questions for you - depending on your answers, the solution below may need some adjustments.
Are all your files in the same directory, or do you also need to look through subdirectories and sub-subdirectories, etc.? Below I make the simplest assumption - that all your files are in a single directory.
Are all your files text files? In the example below, the directory will contain text files, executable files, symbolic links, and directories; the count will only be given for text files. (What linux believe to be text files, anyway.)
There may be files that do not contain the searched-for string at all. Those are not included in the output below. Do you need to show them too, with a count of 0?
I assume by "count occurrences" you mean all of them - even if the string appears more than once on the same line. (Which is why a simple grep -c won't cut it, as that only counts lines that contain the substring, no matter how many times each.)
Do you need to include hidden files (whose name begins with a period)? In my code below I assumed you don't.
Do you care that the count appears first, and then the file name?
OK, so here goes.
[oracle#localhost test]$ ls -al
total 20
drwxr-xr-x. 3 oracle oinstall 81 Apr 3 18:42 .
drwx------. 39 oracle oinstall 4096 Apr 3 18:42 ..
-rw-r--r--. 1 oracle oinstall 40 Apr 3 17:44 aa
lrwxrwxrwx. 1 oracle oinstall 2 Apr 3 18:04 bb -> aa
drwxr-xr-x. 2 oracle oinstall 6 Apr 3 17:40 d1
-rw-r--r--. 1 oracle oinstall 38 Apr 3 17:56 f1
-rw-r--r--. 1 oracle oinstall 0 Apr 3 17:56 f2
-rwxr-xr-x. 1 oracle oinstall 123 Apr 3 18:15 zfgrep
-rw-r--r--. 1 oracle oinstall 15 Apr 3 18:42 .zz
Here's the command to count 'waca' in the text files in this directory (not recursive). I define a variable substr to hold the desired string. (Note that it could also be a regular expression, more generally - but I didn't test that so you will have to, if that's your use case.)
[oracle#localhost test]$ substr=waca
[oracle#localhost test]$ find . -maxdepth 1 -type f \
> -exec grep -osHI "$substr" {} \; | sed "s/^\.\/$.*$:$substr$/\1/" | uniq -c
8 aa
2 f1
1 .zz
Explanation: I use find to find just the files in the current directory (excluding directories, links, and whatever other trash I may have in the directory). This will include the hidden files, and it will include binary files, not just text. In this example I find in the current directory, but you can use any path instead of . I limit the depth to 1, so the command only applies to files in the current directory - the search is not recursive. Then I pass the results to grep. -o means find all matches (even if multiple matches per line of text) and show each match on a separate line. -s is for silent mode (just in case grep thinks of printing messages), -H is to include file names (even when there is only one file matching the substring), and -I is to ignore binary files.
Then I pass this to sed so that from each row output by grep I keep just the file name, without the leading ./ and without the trailing :waca. This step may not be necessary - if you don't mind the output like this:
8 ./aa:waca
2 ./f1:waca
1 ./.zz:waca
Then I pass the output to uniq -c to get the counts.
You can then redirect the output to a file, if that's what you need. (Left as a trivial exercise - since I forgot that was part of the requirement, sorry.)

Thanks for the detailed answer it provides me with ideas for future projects.
In my case the files were all the same format (output from another script) and the only files in the directory.
I found the answer in another thread
grep -c -R 'xxx'

Grep content out of txt file with regex

I have a list in a txt file which looks like this.
10.9.0.18,tom,34.0.1.2:44395,Thu Apr 18 07:14:20 2019
10.9.0.10,jonas,84.32.45.2:44016,Thu Apr 18 07:16:06 2019
10.9.0.6,philip,23.56.222.3:55202,Thu Apr 18 07:16:06 2019
10.9.0.26,coolguy,12.34.56.7:53316,Thu Apr 18 07:16:06 2019
I would like to have a script which provides me with the following output:
tom jonas philip coolguy
I've been looking into something like this:
grep -oP "^10.9.0.*,wq$1-\K.*" | cut -d, -f1 | sort
But I am not quite getting there, getting no output at all.

Extract second field
Replace newlines with spaces
cat <<EOF |
10.9.0.18,tom,34.0.1.2:44395,Thu Apr 18 07:14:20 2019
10.9.0.10,jonas,84.32.45.2:44016,Thu Apr 18 07:16:06 2019
10.9.0.6,philip,23.56.222.3:55202,Thu Apr 18 07:16:06 2019
10.9.0.26,coolguy,12.34.56.7:53316,Thu Apr 18 07:16:06 2019
EOF
cut -d, -f2 | tr '\n' ' '

You're not getting output because grep doesn't return anything (you don't need perl regex for this).
You'll need to select the second field too:
grep '^10\.9\.0\.' data.txt | cut -d, -f

If awk is an option you could try:
awk -F, '{printf "%s ", $2} END {print ""}' file.txt
The {printf "%s ", $2 prevents using the default new line and instead uses a space.
The END {print ""} is to add a new line after finishing

This is the right answer if you want multi-line output:
$ awk -F, '/^10\.9\.0/{print $2}' file
tom
jonas
philip
coolguy
or this for single:
$ awk -F, '/^10\.9\.0/{o=o s $2; s=OFS} END{print o}' file
tom jonas philip coolguy
You need to escape the .s as they represent any character in a regexp and you don't need to add a .* at the end of the regexp as that'll literally match "something or nothing".

How do I grab a specific section of a stdout?

I am trying to grab the sda# of a drive that was just inserted.
tail -f /var/log/messages | grep sda:
Returns: Mar 12 17:21:55 raspberrypi kernel: [ 1133.736632] sda: sda1
I would like to grab the sda1 part of the stdout, how would I do that?

I suggest to use this with GNU grep:
| grep -Po 'sd[a-z]+: \Ksd[a-z0-9]+$'
\K: This sequence resets the starting point of the reported match. Any previously matched characters are not included in the final matched sequence.
See: The Stack Overflow Regular Expressions FAQ

Grep from text file over the past hour

I have several commands similar to:
ping -i 60 8.8.8.8 | while read pong; do echo "$(date): $pong" >> /security/latencytracking/pingcapturetest2.txt; done
output:
Tue Feb 4 15:13:39 EST 2014: 64 bytes from 8.8.8.8: icmp_seq=0 ttl=50
time=88.844 ms
I then search the results using:
cat /security/latencytracking/pingcapturetest* | egrep 'time=........ ms|time=......... ms'
I am looking for latency anomalies over X ms.
Is there a way to search better than I am doing and search over the past 1,2,3, etc. hours as opposed to from the start of the file? This could get tedious over time.

You could add unix timestamp to your log, and then search based on that:
ping -i 60 8.8.8.8 | while read pong; do
echo "$(date +"%s"): $pong" >> log.txt
done
Your log will have entries like:
1391548048: 64 bytes from 8.8.8.8: icmp_req=1 ttl=47 time=20.0 ms
Then search with a combination of date and awk:
Using GNU Date (Linux etc):
awk -F: "\$1 > $(date -d '1 hour ago' +'%s')" log.txt
or BSD Date (Mac OSX, BSD)
awk -F: "\$1 > $(date -j -v '-1H' +%s)" log.txt
The command uses date -d to translate english time-sentence (or date -v for the same task on BSD/OSX) to unix timestamp. awk then compares the logged timestamp (first field before the :) with the generated timestamp and prints all log-lines which have a higher value, ie newer.

If you are familiar with R:
1. I'd slurp the whole thing in with read.table(), drop the unnecessary columns
2. then do whatever calculations you like
Unless you have tens of millions of records, then R might be a bit slow.
Plan B:
1. use cut to nuke anything you dont need and then goto the plan above.

You can also do it with bash. You can compare dates, as follows:
Crop the date field. You can convert that date into the number of seconds since midnight of 1st Jan 1970
date -d "Tue Feb 4 15:13:39 EST 2014" '+%s'
you compare that number against the number of seconds you got one hour ago,
reference=$(date --date='-1 hour' '+%s')
This way you get all records from last hour. Then you can filter after the length of the delay

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart