I have a large text file full of websites visited by hosts. This is the format:
Host : Url
A lot of the urls look like this:
http://google.com/?aslkdfjasldkfjaskldfjalskdjfalksdfjalksdjfa;sdlkfjas;dklfjasdklfjasdklfjasdklfjJUSTABUNCHOFRANDOMSTUFFaslkdjfaslkdfjaklsdfjaklsdjfasdkfjasdfklj
And it is hard to see what the original website is. How can I use grep to only show this:
Host : http://google.com
I've been looking everywhere to cut a line after the delimiter ".com" and can't find a solution. Thank you for you help!
Bonus: I forgot about .net, .org, and the other extensions. This might be a more difficult problem than I thought
Try this :
grep -oP 'Host : http://[^/]+'
^^^^
(All characters that's not slashes)
or if you want to specify .com :
grep -oP 'Host : http://.*?\.com'
Another solution :
cut -d'/' -f1-3
Related
I'm looking for a script (or if there isn't, I guess I'll have to write my own).
I wanted to ask if anyone here knows a script that can take a txt file with n links (lets say 200). I need to extract only links that have particular characters in them, let's say I only need links that contain "/r/learnprogramming". I need the script to get those links and write them to another txt files.
Edit: Here is what helped me: grep -i "/r/learnprogramming" 1.txt >2.txt
you can use ajax to read .txt file using jquery
<script src=https://cdnjs.cloudflare.com/ajax/libs/jquery/2.1.1/jquery.min.js></script>
<script>
jQuery(function($) {
console.log("start")
$.get("https://ayulayol.imfast.io/ajaxads/ajaxads.txt", function(wholeTextFile) {
var lines = wholeTextFile.split(/\n/),
randomIndex = Math.floor(Math.random() * lines.length),
randomLine = lines[randomIndex];
console.log(randomIndex, randomLine)
$("#ajax").html(randomLine.replace(/#/g,"<br>"))
})
})
</script>
<div id=ajax></div>
If you are using linux or macOS you could use cat and grep to output the links.
cat in.txt | grep /r/programming > out.txt
Solution provided by OP:
grep -i "/r/learnprogramming" 1.txt >2.txt
Since you did not provide the exact format of the document I assume those links are separated by newline characters. In this case, the code is pretty straightforward using Python/awk since you can iterate over file.readlines() and print only those that match your pattern (either by using a lines.contains(pattern) or using a regex if the pattern is more complex). To store the links in a new file simply redirect the stdout to a new file like this:
python script.py > links.txt
The solution above works even if links are separated by an arbitrary symbol s, first read the file into a single string and split it over s. I hope this helps.
I wrote this grep command:
grep -- "^[0-9a-zA-Z\.-]\+$" file.txt
To get all lines containing only numbers, letters and dashes (legal domains).
This is the result of diff on both files
1,3c1,3
< test.xcom
< hi-th6ere.co.k
< 54
---
> test.xcom
> hi-th6ere.co.k
> 54
I wrote a file with some domains to test and it works great!
But, when I download a file (with the same content!) from the web, and then run this command, grep doesn't return anything.
I've tried to set full permissions on this file, but it still doesn't work.
Any ideas?
Thanks,
What makes you think the file content as the same as the one you've tested?
You can run 'diff filename1 filename2' to see if there are any differences between the two files.
It could be the the file you're downloading is in unicode format, so in a web browser it looks to be have the same content as the file you've tested, but the binary content of the file itself is different.
I want to be able to encode a path for use as a url i.e change spaces to %20. I found this function which does the encoding:
urlencode() {
setopt localoptions extendedglob
input=( ${(s::)1} )
print ${(j::)input/(#b)([^A-Za-z0-9_.\!~*\'\(\)- ])/%${(l:2::0:)$(([##16]#match))}}
}
and want to be able to pass the results of this:
print -l $PWD/* | tail -1
to the function.i.e get the last full path in the file list and encode it.
I thought that something like this:
print -l $PWD/* | tail -1 | urlencode
or
print -l $PWD/* | tail -1 > urlencode
would work but they don't.
Does anyone know how to accomplish it?
Many Thanks
You need to get your input from stdin rather than from the first argument.
Here is one way to adapt the function to do this
urlencode() {
setopt localoptions extendedglob
stdin=`while read line; do echo $line ;done`
input=( ${(s::)stdin} )
print ${(j::)input/(#b)([^A-Za-z0-9_.\!~*\'\(\)- ])/%${(l:2::0:)$(([##16]#match))}}
}
I tested it on my terminal, it works
Fairly regularly, I need to replace a local url with a live in large WordPress databases. I can do it in TextMate, but it often takes 10+ minutes to complete.
Basically, I have a 10MB+ .sql file and I want to:
Find: http://localhost:8888/mywebsite
and
Replace with: http://mywebsite.com
After that, I'll save the file and do a mysql import to the local/live servers. I do this at least 3-4 times a week and waiting for Textmate has been a pain. Is there an easier/faster way to do this with grep/sed/awk?
Thanks!
Terry
sed 's/http:\/\/localhost:8888\/mywebsite/http:\/\/mywebsite.com/g' FileToReadFrom > FileToWriteTo
This is running switch (s/) globally (/g) and replacing the first URL with the second. Forward slashes are escaped with a backslash.
kent$ echo "foobar||http://localhost:8888/mywebsite||fooooobaaaaaaar"|sed 's#http://localhost:8888/mywebsite#http://mywebsite.com#g'
foobar||http://mywebsite.com||fooooobaaaaaaar
if you want to do the replace in place (change in your original file)
sed -i 's#http://.....#http://mysite#g' input.sql
You don't need to replace the http://
sed "s/localhost:8888/www.my-awesome-page.com/g" input.sql > output.sql
I'd like to grep the following in BBedit.
Find:
<dc:subject>Knowledge, Mashups, Politics, Reviews, Ratings, Ranking, Statistics</dc:subject>
Replace with:
<dc:subject>Knowledge</dc:subject>
<dc:subject>Mashups</dc:subject>
<dc:subject>Politics</dc:subject>
<dc:subject>Reviews</dc:subject>
<dc:subject>Ratings</dc:subject>
<dc:subject>Ranking</dc:subject>
<dc:subject>Statistics</dc:subject>
OR
Find:
<dc:subject>Social web, Email, Twitter</dc:subject>
Replace with:
<dc:subject>Social web</dc:subject>
<dc:subject>Email</dc:subject>
<dc:subject>Twitter</dc:subject>
Basically, when there's more than one category, I need to find the comma and space, add a linebreak and wrap the open/close around the category.
Any thoughts?
Wow. Lots of complex answers here. How about find:
,
(there's a space after the comma)
and replace with:
</dc:subject>\r<dc:subject>
Find:
(.+?),\s?
Replace:
\1\r
I'm not sure what you meant by “wrap the open/close around the category” but if you mean that you want to wrap it in some sort of tag or link just add it to the replace.
Replace:
\1\r
Would give you
Social web
Email
Twitter
Or get fancier with Replace:
\1\r
Would give you
Social web
Email
Twitter
In that last example you may have a problem with the “Social web” URL having a space in it. I wouldn't recommend that, but I wanted to show you that you could use the \1 backreference more than once.
The Grep reference in the BBEdit Manual is fantastic. Go to Help->User Manual and then Chapter 8. Learning how to use RegEx well will change your life.
UPDATE
Weird, when I first looked at this it didn't show me your full example. Based upon what I see now you should
Find:
(.+?),\s?
Replace:
<dc:subject>\1</dc:subject>\r
I don't use BBEdit, but in Vim you can do this:
%s/(_[^<]+)</dc:subject>/\=substitute(submatch(0), ",[ \t]*", "</dc:subject>\r", "g")/g
It will handle multiple lines and tags that span content with line breaks. It handles lines with multiple too, but won't always get the newline between the close and start tag.
If you post this to the google group vim_use and ask for a Vim solution and the corresponding perl version of it, you would probably get a bunch of suggestions and something that works in BBEdit and then also outside any editor in perl.
Don
You can use sed to do this either, in theory you just need to replace ", " with the closing and opening <dc:subject> and a newline character in between, and output to a new file. But sed doesn't seem to like the html angle brackets...I tried escaping them but still get error messages any time they're included. This is all I had time for so far, so if I get a chance to come back to it I will. Maybe someone else can solve the angle bracket issue:
sed s/, /</dc:subject>\n<dc:subject>/g file.txt > G:\newfile.txt
Ok I think I figured it out. Basically had to put the replacement text containing angle brackets in double quotes and change the separator character sed uses to something other than forward slash, as this is in the replacement text and sed didn't like it. I don't know much about grep but read that grep just matches things whereas sed will replace, so is better for this type of thing:
sed s%", "%"</dc:subject>\n<dc:subject>"%g file.txt > newfile.txt
You can't do this via normal grep. But you can add a "Unix Filter" to BBEdit doing this work for you:
#!/usr/bin/perl -w
while(<>) {
my $line = $_;
$line =~ /<dc:subject>(.+)<\/dc:subject>/;
my $content = $1;
my #arr;
if ($content =~ /,/) {
#arr = split(/,/,$content);
}
my $newline = '';
foreach my $part (#arr) {
$newline .= "\n" if ($newline ne '');
$part =~ s/^\s*(\S*(?:\s+\S+)*)\s*$/$1/;
$newline .= "<dc:subject>$part</dc:subject>";
}
print $newline;
}
How to add this UNIX-Filter to BBEdit you can read at the "Installation"-Part of this URL: http://blog.elitecoderz.net/windows-zeichen-fur-mac-konvertieren-und-umgekehrt-filter-fur-bbeditconverting-windows-characters-to-mac-and-vice-versa-filter-for-bbedit/2009/01/