Get URLs from a remote page and then download to txt file - url

I tried lots of suggestion but i can't find a solution (I don't know if it's possible) I use terminal of Ubuntu 15.04
I'd need to download in a text file all of internal and external links from mywebsite.com/links_ (all links start with links_) For example http://www.mywebsite.com/links_sony.aspx I don't need all other links ex. mywebsite.com/index.aspx or conditions.asp etc. I use
wget --spider --recursive --no-verbose --output-file="links.csv" http://www.mywebsite.com
Can you help me please? Thanks in advance

If you don't mind using a couple of other tools to coax wget, then you can try this bash script that employs awk, grep, wget and lynx:
#! /bin/bash
lynx --dump $1 | awk '/http/{print $2}' | grep $2 > /tmp/urls.txt
for i in $( cat /tmp/urls.txt ); do wget $i; done
Save the above script as getlinks and then run it as
./getlinks 'http://www.mywebsite.com' 'links_' > mycollection.txt
This approach does not load/need too many other tools; instead reuses commonly available tools.
You may have to play with quoting depending what shell you are using. The above works in standard bash and is not dependent on specific versions of these tools.
You could customize the part
do wget $1
with appropriate switches to meet your specific needs, such as recursive, spider, verbosity, etc. Insert those switches between wget and $1.

Related

Grep with RegEx Inside a Docker Container?

Dipping my toes into Bash coding for the first time (not the most experienced person with Linux either) and I'm trying to read the version from the version.php inside a container at:
/config/www/nextcloud/version.php
To do so, I run:
docker exec -it 1c8c05daba19 grep -eo "(0|[1-9]\d*)\.(0|[1-9]\d*)\.(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+([0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?" /config/www/nextcloud/version.php
This uses a semantic versioning RegEx pattern (I know, a bit overkill, but it works for now) to read and extract the version from the line:
$OC_VersionString = '20.0.1';
However, when I run the command it tells me No such file or directory, (I've confirmed it does exist at that path inside the container) and then proceeds to spit out the entire contents of the file it just said doesn't exist?
grep: (0|[1-9]\d*).(0|[1-9]\d*).(0|[1-9]\d*)(?:-((?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-])(?:.(?:0|[1-9]\d|\d*[a-zA-Z-][0-9a-zA-Z-]))))?(?:+([0-9a-zA-Z-]+(?:.[0-9a-zA-Z-]+)*))?: No such file or directory
/config/www/nextcloud/version.php:$OC_Version = array(20,0,1,1);
/config/www/nextcloud/version.php:$OC_VersionString = '20.0.1';
/config/www/nextcloud/version.php:$OC_Edition = '';
/config/www/nextcloud/version.php:$OC_VersionCanBeUpgradedFrom = array (
/config/www/nextcloud/version.php: 'nextcloud' =>
/config/www/nextcloud/version.php: 'owncloud' =>
/config/www/nextcloud/version.php:$vendor = 'nextcloud';
Anyone able to spot the problem?
Update 1:
For the sake of clarity, I'm trying to run this from a bash script. I just want to fetch the version number from that file, to use it in other areas of the script.
Update 2:
Responding to the comments, I tried to login to the container first, and then run the grep, and still get the same result. Then I cat that file and it shows it's contents no problem.
Many containers don't have the GNU versions of Unix tools and their various extensions. It's popular to base containers on Alpine Linux, which in turn uses a very lightweight single-binary tool called BusyBox to provide the base tools. Those tend to have the set of options required in the POSIX specs, and no more.
POSIX grep(1) in particular doesn't have an -o option. So the command you're running is
grep \
-eo \ # specify "o" as the regexp to match
"(regexps are write-only)" \ # a filename
/config/www/nextcloud/version.php # a second filename
Notice that the grep output in the interactive shell only contains lines with the letter "o", but not for example the line just containing array.
POSIX grep doesn't have an equivalent for GNU grep's -o option
Print only the matched (non-empty) parts of matching lines, with each such part on a separate output line. Output lines use the same delimiters as input....
but it's easy to do that with sed(1) instead. Ask it to match some stuff, the regexp in question, and some stuff, and replace it with the matched group.
sed -e 's/.*\(any regexp here\).*/\1/' input-file
(POSIX sed only accepts basic regular expressions, so you'll have to escape more of the parentheses.)
Well, for any potential future readers, I had no luck getting grep to do it, I'm sure it was my fault somehow and not grep's, but thanks to the help in this post I was able to use awk instead of grep, like so:
docker exec -it 1c8c05daba19 awk '/^\$OC_VersionString/ && match($0,/\047[0-9]+\.[0-9]+\.[0-9]+\047/){print substr($0,RSTART+1,RLENGTH-2)}' /config/www/nextcloud/version.php
That ended up doing exactly what I needed:
It logs into a docker container.
Scans and returns just the version number from the line I am looking for at: /config/www/nextcloud/version.php inside the container.
Exits stage left from the container with just the info I needed.
I can get right back to eating my Hot Cheetos.

Get similar links from one site using wget

I have a site (http://a-site.com) with many links like that. How can I use wget to crawl and grep all similar links to a file?
Follow
I tried this but this command only get me all similar links on one page but not recursively follow other links to find similar links.
$ wget -erobots=off --no-verbose -r --quiet -O - http://a-site.com 2>&1 | \
grep -o '['"'"'"][^"'"'"']*/follow_user['"'"'"]'
You may want to use the --accept-regex option of wget rather than piping through grepĀ :
wget -r --accept-regex '['"'"'"][^"'"'"']*/follow_user['"'"'"]' http://a-site.com
(not tested, the regex may need adjustment or specification of --regex-type (see man wget), and of course add other options you find useful).

How to get fully working grep in git bash (msysgit) on windows?

I would like to use grep -o, but in git bash there is no -o option. Is there a way to get full working grep in git bash, just like it's in linux bash shell?
There is no -o flag for grep
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/grep.html
You can use sed instead
There is an open issue for that on Github (even though it's under "nvm"). User UltCombo posted a workaround. Quoting:
Open <Git install directory>/bin and overwrite grep.exe with a more up to date version. I found two alternatives that provide -o support:
GnuWin32's grep 2.5.4 (link).
ezwinports' grep 2.10 (link). Note: You also have to extract libprce-0.dll in the same folder as grep.
Though ezwinports' grep port is much more up to date, I can't say whether any of these will cause stability/compatibility issues. I haven't found any issues yet, but use it at your own risk.
Marking this Community Wiki because it's really somebody else's work.
Alternatively, get the pretty awesome MSYS2 and enjoy full grep and co.

Why is PHP CodeSniffer Freezing?

I'm a Junior Programmer where I work. Our website was written using PHP 4. We're migrating from PHP 4 to PHP 5.3. There are roughly 5000 PHP files in around 595 directories. So, as you can imagine, the scope of this project is pretty huge.
We use Subversion for version control. I have two separate checkouts. I have two VMs that act as separate webhosts - one stack emulates our actual webserver (CentOS 4, PHP4, etc) and the other is a PHP 5.3 stack (Ubuntu 12.04 LTS).
I took the time to check the files for basic syntax errors using the following commands:
Edit: I ran the following recursive searches from the root of the website.
find ./ -type f -name \*.php -exec php -l {} \; < ~/php5_basic_syntax_assessment.txt
find ./ -type f -name \*.inc -exec php -l {} \; < ~/php5_basic_syntax_inc_assessment.txt
I realize that using php -l to check basic syntax doesn't reveal deprecated code structures/functions and doesn't provide warnings (IE: use preg_slice() instead of slice()). Therefore, I decided to install PHP CodeSniffer.
First, I installed PEAR: [I accepted all the default parameters]
cd ~/
mkdir pear
cd pear
wget http://pear.php.net/go-pear.phar
php go-pear.phar
Next, I installed git:
cd ~/
sudo apt-get update
sudo apt-get install git
Next, I installed PHP Code Sniffer
pear install PHP_CodeSniffer
Finally, I installed the following PHP 5.3 Compatibility standards for the PHP Code Sniffer:
git clone git://github.com/wimg/PHP53Compat_CodeSniffer.git PHP53Compatibility
I did all of the above so that I could assess the 5K PHP files in an automated kind of way. It would be extremely tedious and time consuming to go through each file to make sure they manually follow the PHP 5.3 coding standards.
Finally, here's the command I used to run the PHP Code Sniffer:
phpcs --standard=/home/my_user_name/PHP53Compatibility -p --report-file=/home/my_user_name/php53_assessment.txt /path/to/web/root
To make sure that the specific standards aren't the problem, I also ran the PHP Code Sniffer using the default standards:
phpcs -p --report-file=/home/my_user_name/php53_assessment.txt /path/to/web/root
Either way, the reports freeze in the same place. I've been awake for over 24 hours. I waited for 18 hours before stopping the first run by using CTRL+C. The second is still running and has been running for about an hour and a half.
So, what is causing my PHP Code Sniffer to freeze?
All help is very much appreciated.
Bit late, but I ran into the same issue. Limit the files to just PHP files should do the trick: phpcs -p -- ./**/*.php

find a command on $PATH

I'm writing a script, and I need to look up a command on the user's $PATH and get the full path to the command. The problem is that I don't know what the user's login shell is, or what strange stuff might be in their do files. I'm using the bourne shell for my simple little script because it needs to run on some older Solaris platforms that might not have bash.
Some implementations of "which" and "whence" will source the user's dot files, and that isn't really portable to all users. I'd love a simple UNIX utility that would just do the basic job of scanning PATH for an executable and reporting the full path of the first match.
But I'll settle for any /bin/sh solution that is stable for all users.
I'm looking for a solution that is better than writing my own /bin/sh loop that chops up $PATH and searches it one line at a time. It would seem that this is common enough that there should be an reusable way to do it.
My first approximation of the "long way" is this:
IFS=:
for i in $PATH; do
if [ -x $i/$cmd ]; then
echo $i/$cmd
fi
done
Is there something simpler and portable?
The answer seems to be the 'type' built-in.
% /bin/sh
$ type ls
ls is /bin/ls
Maybe the whereis command will work for you?
whereis -b -B `echo $PATH | sed 's/:/ /g'` -f [commands]
e.g. on my computer, this works:
whereis -b -B `echo $PATH | sed 's/:/ /g'` -f find man fsc
And results in:
find: /usr/bin/find
man: /usr/bin/man
fsc: /opt/FSharp-2.0.0.0/bin/fsc.exe /opt/FSharp-2.0.0.0/bin/fsc
One caveat from the whereis man page:
Since whereis uses chdir(2V) to run faster, pathnames given
with the -M, -S, or -B must be full; that is, they must begin
with a `/'.
This question is answered in details here: https://unix.stackexchange.com/questions/85249/why-not-use-which-what-to-use-then. Bottom line: use command -v ls.

Resources