Grep regex command - grep

I currently have,
grep -G <a href="machines.jsp\?type=active>.<\/a>
and this matches the string I am looking for
<a href="machines.jsp\?type=active>4<\/a>
but I am not getting any output. Could someone tell me what am I doing wrong?
Edit:
Well, Silly mistake. I missed the " to complete the href

Related

Is there a script that can extract particular link from txt and write it in another txt file?

I'm looking for a script (or if there isn't, I guess I'll have to write my own).
I wanted to ask if anyone here knows a script that can take a txt file with n links (lets say 200). I need to extract only links that have particular characters in them, let's say I only need links that contain "/r/learnprogramming". I need the script to get those links and write them to another txt files.
Edit: Here is what helped me: grep -i "/r/learnprogramming" 1.txt >2.txt
you can use ajax to read .txt file using jquery
<script src=https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.1/jquery.min.js></script>
<script>
jQuery(function($) {
console.log("start")
$.get("https://ayulayol.imfast.io/ajaxads/ajaxads.txt", function(wholeTextFile) {
var lines = wholeTextFile.split(/\n/),
randomIndex = Math.floor(Math.random() * lines.length),
randomLine = lines[randomIndex];
console.log(randomIndex, randomLine)
$("#ajax").html(randomLine.replace(/#/g,"<br>"))
})
})
</script>
<div id=ajax></div>
If you are using linux or macOS you could use cat and grep to output the links.
cat in.txt | grep /r/programming > out.txt
Solution provided by OP:
grep -i "/r/learnprogramming" 1.txt >2.txt
Since you did not provide the exact format of the document I assume those links are separated by newline characters. In this case, the code is pretty straightforward using Python/awk since you can iterate over file.readlines() and print only those that match your pattern (either by using a lines.contains(pattern) or using a regex if the pattern is more complex). To store the links in a new file simply redirect the stdout to a new file like this:
python script.py > links.txt
The solution above works even if links are separated by an arbitrary symbol s, first read the file into a single string and split it over s. I hope this helps.

How to cut a line after a phrase is found?

I have a large text file full of websites visited by hosts. This is the format:
Host : Url
A lot of the urls look like this:
http://google.com/?aslkdfjasldkfjaskldfjalskdjfalksdfjalksdjfa;sdlkfjas;dklfjasdklfjasdklfjasdklfjJUSTABUNCHOFRANDOMSTUFFaslkdjfaslkdfjaklsdfjaklsdjfasdkfjasdfklj
And it is hard to see what the original website is. How can I use grep to only show this:
Host : http://google.com
I've been looking everywhere to cut a line after the delimiter ".com" and can't find a solution. Thank you for you help!
Bonus: I forgot about .net, .org, and the other extensions. This might be a more difficult problem than I thought
Try this :
grep -oP 'Host : http://[^/]+'
^^^^
(All characters that's not slashes)
or if you want to specify .com :
grep -oP 'Host : http://.*?\.com'
Another solution :
cut -d'/' -f1-3

ROUGE evaluation method gives zero value

I have set all parameters as discribed in http://kavita-ganesan.com/rouge-howto. But I get zero values of precision recall and f-1. Please, Help me what can i do?
If you have set all parameters right and are not getting any error while running rouge then probably you are doing the following mistake while making your summary files in html format.
rouge does not handle whitespaces properly
thus
<a name="1">[1]</a> <a href="#1" id= 1>
<a name="1">[1]</a> <a href="#1" id=1>
are not the same
In the first case you will not be shown any error but the output will be zero. In second case you get the right output.
Hope this helps..
The settings.xml file should look something like this:
<ROUGE_EVAL version="1.5.5">
<EVAL ID="1">
<PEER-ROOT>systems</PEER-ROOT>
<MODEL-ROOT>models</MODEL-ROOT>
<INPUT-FORMAT TYPE="SPL" />
<PEERS>
<P ID="1">peer.txt</P>
</PEERS>
<MODELS>
<M ID="A">modelA.txt</M>
<M ID="B">modelB.txt</M>
<M ID="C">modelC.txt</M>
<M ID="D">modelD.txt</M>
</MODELS>
</EVAL>
</ROUGE_EVAL>
Although your input format type might be different, I find that SPL works for .txt and SEE is for HTML.
One thing that tripped me up was: <M ID="A">modelA.txt</M>, I had it as <P ID="A">modelA.txt</P>, ROUGE didn't complain, it just came out with 0 for every value. So keep an eye out for small things like that.

Help with grep in BBEdit

I'd like to grep the following in BBedit.
Find:
<dc:subject>Knowledge, Mashups, Politics, Reviews, Ratings, Ranking, Statistics</dc:subject>
Replace with:
<dc:subject>Knowledge</dc:subject>
<dc:subject>Mashups</dc:subject>
<dc:subject>Politics</dc:subject>
<dc:subject>Reviews</dc:subject>
<dc:subject>Ratings</dc:subject>
<dc:subject>Ranking</dc:subject>
<dc:subject>Statistics</dc:subject>
OR
Find:
<dc:subject>Social web, Email, Twitter</dc:subject>
Replace with:
<dc:subject>Social web</dc:subject>
<dc:subject>Email</dc:subject>
<dc:subject>Twitter</dc:subject>
Basically, when there's more than one category, I need to find the comma and space, add a linebreak and wrap the open/close around the category.
Any thoughts?
Wow. Lots of complex answers here. How about find:
,
(there's a space after the comma)
and replace with:
</dc:subject>\r<dc:subject>
Find:
(.+?),\s?
Replace:
\1\r
I'm not sure what you meant by “wrap the open/close around the category” but if you mean that you want to wrap it in some sort of tag or link just add it to the replace.
Replace:
\1\r
Would give you
Social web
Email
Twitter
Or get fancier with Replace:
\1\r
Would give you
Social web
Email
Twitter
In that last example you may have a problem with the “Social web” URL having a space in it. I wouldn't recommend that, but I wanted to show you that you could use the \1 backreference more than once.
The Grep reference in the BBEdit Manual is fantastic. Go to Help->User Manual and then Chapter 8. Learning how to use RegEx well will change your life.
UPDATE
Weird, when I first looked at this it didn't show me your full example. Based upon what I see now you should
Find:
(.+?),\s?
Replace:
<dc:subject>\1</dc:subject>\r
I don't use BBEdit, but in Vim you can do this:
%s/(_[^<]+)</dc:subject>/\=substitute(submatch(0), ",[ \t]*", "</dc:subject>\r", "g")/g
It will handle multiple lines and tags that span content with line breaks. It handles lines with multiple too, but won't always get the newline between the close and start tag.
If you post this to the google group vim_use and ask for a Vim solution and the corresponding perl version of it, you would probably get a bunch of suggestions and something that works in BBEdit and then also outside any editor in perl.
Don
You can use sed to do this either, in theory you just need to replace ", " with the closing and opening <dc:subject> and a newline character in between, and output to a new file. But sed doesn't seem to like the html angle brackets...I tried escaping them but still get error messages any time they're included. This is all I had time for so far, so if I get a chance to come back to it I will. Maybe someone else can solve the angle bracket issue:
sed s/, /</dc:subject>\n<dc:subject>/g file.txt > G:\newfile.txt
Ok I think I figured it out. Basically had to put the replacement text containing angle brackets in double quotes and change the separator character sed uses to something other than forward slash, as this is in the replacement text and sed didn't like it. I don't know much about grep but read that grep just matches things whereas sed will replace, so is better for this type of thing:
sed s%", "%"</dc:subject>\n<dc:subject>"%g file.txt > newfile.txt
You can't do this via normal grep. But you can add a "Unix Filter" to BBEdit doing this work for you:
#!/usr/bin/perl -w
while(<>) {
my $line = $_;
$line =~ /<dc:subject>(.+)<\/dc:subject>/;
my $content = $1;
my #arr;
if ($content =~ /,/) {
#arr = split(/,/,$content);
}
my $newline = '';
foreach my $part (#arr) {
$newline .= "\n" if ($newline ne '');
$part =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$newline .= "<dc:subject>$part</dc:subject>";
}
print $newline;
}
How to add this UNIX-Filter to BBEdit you can read at the "Installation"-Part of this URL: http://blog.elitecoderz.net/windows-zeichen-fur-mac-konvertieren-und-umgekehrt-filter-fur-bbeditconverting-windows-characters-to-mac-and-vice-versa-filter-for-bbedit/2009/01/

How can I programmatically get the image on this page?

The URL http://www.fourmilab.ch/cgi-bin/Earth shows a live map of the Earth.
If I issue this URL in my browser (FF), the image shows up just fine. But when I try 'wget' to fetch the same page, I fail!
Here's what I tried first:
wget -p http://www.fourmilab.ch/cgi-bin/Earth
Thinking, that probably all other form fields are required too, I did a 'View Source' on the above page, noted down the various field values, and then issued the following URL:
wget --post-data "opt=-p&lat=7°27'&lon=50°49'&ns=North&ew=East&alt=150889769&img=learth.evif&date=1&imgsize=320&daynight=-d" http://www.fourmilab.ch/cgi-bin/Earth
Still no image!
Can someone please tell me what is going on here...? Are there any 'gotchas' with CGI and/or form-POST based wgets? Where (book or online resource) would such concepts be explained?
If you will inspect the page's source code, there's a link with img inside, that contains the image of earth. For example:
<img
src="/cgi-bin/Earth?di=570C6ABB1F33F13E95631EFF088262D5E20F2A10190A5A599229"
ismap="ismap" usemap="#zoommap" width="320" height="320" border="0" alt="" />
Without giving the 'di' parameter, you are just asking for whole web page, with references to this image, not for the image itself.
Edit: 'Di' parameter encodes which "part" of the earth you want to receive, anyway, try for example
wget http://www.fourmilab.ch/cgi-bin/Earth?di=F5AEC312B69A58973CCAB756A12BCB7C47A9BE99E3DDC5F63DF746B66C122E4E4B28ADC1EFADCC43752B45ABE2585A62E6FB304ACB6354E2796D9D3CEF7A1044FA32907855BA5C8F
Use GET instead of POST. They're completely different for the CGI program in the background.
Following on from Ravadre,
wget -p http://www.fourmilab.ch/cgi-bin/Earth
downloads an XHTML file which contain an <img> tag.
I edited the XHTML to remove everything but the img tag and turned it into a bash script containing another wget -p command, escaping the ? and =
When I executed this I got a 14kB file which I renamed earth.jpg
Not really programmatic, the way I did it, but I think it could be done.
But as #somedeveloper said, the di value is changing (since it depends on time).
Guys, here's what I finally did. Not fully happy with this solution, as I was (and am still) hoping for a better way... one that gets the image on the first wget itself... giving me the same user experience I get when browsing via firefox.
#!/bin/bash
tmpf=/tmp/delme.jpeg
base=http://www.fourmilab.ch
liveurl=$(wget -O - $base/cgi-bin/Earth?opt=-p 2>/dev/null | perl -0777 -nle 'if(m#<img \s+ src \s* = \s* "(/cgi-bin/Earth\?di= .*? )" #gsix) { print "$1\n" }' )
wget -O $tmpf $base/$liveurl &>/dev/null
What you are downloading is the whole HTML page and not the image. To download the image and other elements too, you'll need to use the --page-requisites (and possibly --convert-links) parameter(s). Unfortunately because robots.txt disallows access to URLs under /cgi-bin/, wget will not download the image which is located under /cgi-bin/. AFAIK there's no parameter to disable the robots protocol.

Resources