Grep from wget without saving files

Grep from wget without saving files - grep

I am trying to download a site (with permission) and grepping a particular text from that. The problem is I want to grep on the go without saving any files on local drive. Following command does not help.
wget --mirror site.com -O - | grep TEXT

wget command manual (man page) tells, the usage of the command should be:
wget [option]... [URL]...
in your case, it should be:
wget --mirror -O - site.com|grep TXT

You can use curl:
curl -s http://www.site.com | grep TEXT

how about this
wget -qO- site.com |grep TEXT
and
curl -vs site.com 2>&1 |grep TEXT

Related

Jenkins check file exists inside zip file

is there a way to check a file exists inside a zip file without unzip it. I'm using Artifactory . if use curl can't. can advice me,
I tried below
sh(script: "curl -o /dev/null -sf <antifactory url>")
this always return success
andbelow
unzip -l <file.zip> | grep -q <file name>
this need install unzip

From Artifactory 7.15.3 as mentioned in this page archive search is disabled by default. Can you confirm if you are version above this. If yes, you can enable this feature by Navigating to Admin > Artifactory > Archive Search Enabled and enable this checkbox. But please be aware that, if we enable this feature, it keeps writing a lot of information to the database for every archive file and can impact the performance.
Later you can search items in a zip file. Below is an example command where I am searching for .class files in my zip from curl. You may opt similar to this in Jenkis.
$ curl -uadmin:password -X GET "http://localhost:8082/artifactory/api/search/archive?name=*.class&repos=test" -H "Content-type: application/json"

You can make use of bash commands unzip and grep.
unzip -l my_file.zip | grep -q file_to_search
# Example
$ unzip -l 99bottles.zip | grep -q 99bottles.pdf && echo $?
0
P.S. If zip contains directory structure, then grep with full path of the file name

Merge these wget & egrep commands for recursive download of sitemap

I am trying to find a way to make these work together. Whereas I can run this successfully using Wget for Windows:
wget --html-extension -r http://www.sitename.com
this downloads every single file on my server that is directory linked from the root domain. I'd rather download only the pages in my sitemap. For this, I found the following trick which uses CygWin:
wget --quiet https://www.sitename.com/sitemap.xml --output-document - | egrep -o
"http://www\.sitename\.com[^<]+" | wget --spider -i - --wait 1
However this is only checking that the pages exist, not downloading them as static HTML files as the prior wget command is doing.
Is there a way to merge these and download the sitemap pages as local html files?

If you look at the man page for wget, you will see that the --spider entry is as follows:
--spider
When invoked with this option, Wget will behave as a Web spider, which means that it will not download the pages, just check that they are there.
All you need to do to actually download the file is remove the --spider from your command.
wget --quiet https://www.sitename.com/sitemap.xml --output-document - | egrep -o \
"https?://www\.sitename\.com[^<]+" | wget -i - --wait 1

How to pass a URL to Wget

If I have a document with many links and I want to download especially one picture with the name www.website.de/picture/example_2015-06-15.jpeg, how can I write a command that downloads me automatically exactly this one I extracted out of my document?
My idea would be this, but I'll get a failure message like "wget: URL is missing":
grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document | wget

Use xargs:
grep etc... | xargs wget
It takes its stdin (grep's output), and passes that text as command line arguments to whatever application you tell it to.
For example,
echo hello | xargs echo 'from xargs '
produces:
from xargs hello

Using back ticks would be the easiest way of doing it:
wget `grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document`

This will do too:
wget "$(grep -E 'www.website.de/picture/example_2015-06-15.jpeg' document)"

Getting Wget and JQL to run

We are trying to run a command wget -O xyz.xls --user=COldPolar --password=GlacierICe --ignore-length=on "http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000"
This is returning a 3kb output
If we open IE and use the following "http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000" This allows us to save a 1.7MB file. Please advise how to get the wget to work

If you can use cURL you can do:
curl -o xyz.xls -u COldPolar:GlacierICe 'http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000'
What I managed to get wget to work was to do this first:
wget --save-cookies cookies.txt --post-data 'os_username=COldPolar&os_password=GlacierICe&os_cookie=true' http://Colder.near.com:8080/login.jsp
And then:
wget -O xyz.xls --load-cookies cookies.txt "http://Colder.near.com:8080/sr/jira.issueviews:searchrequest-excel-current-fields/temp/SearchRequest.xls?&runQuery=true(jqlQuery=project%3DCCD)&tempMax=1000"

One of the things to watch out for is a self-signed certificate. You can rule this out by running with --no-check-certificate.

Spider a Website and Return URLs Only

I'm looking for a way to pseudo-spider a website. The key is that I don't actually want the content, but rather a simple list of URIs. I can get reasonably close to this idea with Wget using the --spider option, but when piping that output through a grep, I can't seem to find the right magic to make it work:
wget --spider --force-html -r -l1 http://somesite.com | grep 'Saving to:'
The grep filter seems to have absolutely no affect on the wget output. Have I got something wrong or is there another tool I should try that's more geared towards providing this kind of limited result set?
UPDATE
So I just found out offline that, by default, wget writes to stderr. I missed that in the man pages (in fact, I still haven't found it if it's in there). Once I piped the return to stdout, I got closer to what I need:
wget --spider --force-html -r -l1 http://somesite.com 2>&1 | grep 'Saving to:'
I'd still be interested in other/better means for doing this kind of thing, if any exist.

The absolute last thing I want to do is download and parse all of the content myself (i.e. create my own spider). Once I learned that Wget writes to stderr by default, I was able to redirect it to stdout and filter the output appropriately.
wget --spider --force-html -r -l2 $url 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\)$' \
> urls.m3u
This gives me a list of the content resource (resources that aren't images, CSS or JS source files) URIs that are spidered. From there, I can send the URIs off to a third party tool for processing to meet my needs.
The output still needs to be streamlined slightly (it produces duplicates as it's shown above), but it's almost there and I haven't had to do any parsing myself.

Create a few regular expressions to extract the addresses from all
<a href="(ADDRESS_IS_HERE)">.
Here is the solution I would use:
wget -q http://example.com -O - | \
tr "\t\r\n'" ' "' | \
grep -i -o '<a[^>]\+href[ ]*=[ \t]*"\(ht\|f\)tps\?:[^"]\+"' | \
sed -e 's/^.*"\([^"]\+\)".*$/\1/g'
This will output all http, https, ftp, and ftps links from a webpage. It will not give you relative urls, only full urls.
Explanation regarding the options used in the series of piped commands:
wget -q makes it not have excessive output (quiet mode).
wget -O - makes it so that the downloaded file is echoed to stdout, rather than saved to disk.
tr is the unix character translator, used in this example to translate newlines and tabs to spaces, as well as convert single quotes into double quotes so we can simplify our regular expressions.
grep -i makes the search case-insensitive
grep -o makes it output only the matching portions.
sed is the Stream EDitor unix utility which allows for filtering and transformation operations.
sed -e just lets you feed it an expression.
Running this little script on "http://craigslist.org" yielded quite a long list of links:
http://blog.craigslist.org/
http://24hoursoncraigslist.com/subs/nowplaying.html
http://craigslistfoundation.org/
http://atlanta.craigslist.org/
http://austin.craigslist.org/
http://boston.craigslist.org/
http://chicago.craigslist.org/
http://cleveland.craigslist.org/
...

I've used a tool called xidel
xidel http://server -e '//a/#href' |
grep -v "http" |
sort -u |
xargs -L1 -I {} xidel http://server/{} -e '//a/#href' |
grep -v "http" | sort -u
A little hackish but gets you closer! This is only the first level. Imagine packing this up into a self recursive script!

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Grep from wget without saving files - grep

I am trying to download a site (with permission) and grepping a particular text from that. The problem is I want to grep on the go without saving any files on local drive. Following command does not help. wget --mirror site.com -O - | grep TEXT

wget command manual (man page) tells, the usage of the command should be: wget [option]... [URL]... in your case, it should be: wget --mirror -O - site.com|grep TXT

You can use curl: curl -s http://www.site.com | grep TEXT

how about this wget -qO- site.com |grep TEXT and curl -vs site.com 2>&1 |grep TEXT

Related

Jenkins check file exists inside zip file

Merge these wget & egrep commands for recursive download of sitemap

How to pass a URL to Wget

Getting Wget and JQL to run

Spider a Website and Return URLs Only

Categories

Resources