Is there anyway to use Xidel to query either Bing or Google image search and then extract all the URL link for images from that search? I was interested in doing this via the command line using the Xidel.EXE. Thanks
K
Sure. Great you found Xidel. Great cmdline scraper, but very few people seem to know about it.
Here's a oneliner that scrapes 100 "dogs" image urls of google-images:
xidel -s "https://images.google.com" ^
--user-agent="Mozilla/5.0 (Windows NT 6.1; WOW64;) Firefox/40" ^
-f "form(//form,{'q':'dogs'})" ^
-e "<div class='rg_meta'>{extract(.,'ou.:.(.+?).,',1)}</div>*"
BTW, Google actually wants you to use their API, for which you can request an APIkey, but the above command just pretends to be a browser.
Also, if you add --download at the end, it will download all pics. :-)
Related
During the last weekend some of my sites logged errors implying wrong usage of our URLs:
...news.php?lang=EN&id=23'A=0
or
...news.php?lang=EN&id=23'0=A
instead of
...news.php?lang=EN&id=23
I found only one page originally which mentioned this (https://forums.adobe.com/thread/1973913) where they speculated that the additional query string comes from GoogleBot or an encoding error.
I recently changed my sites to use PDO instead of mysql_*. Maybe this change caused the errors? Any hints would be useful.
Additionally, all of the requests come from the same user-agent shown below.
Mozilla/5.0 (Windows; U; Windows NT 5.1; pt-PT; rv:1.9.1.2) Gecko/20090729 Firefox/3.5.2 (.NET CLR 3.5.30729)
This lead me to find the following threads:
pt-BR
and
Strange parameter in URL - what are they trying?
It is a bot testing for SQL injection vulnerabilities by closing a query with apostrophe, then setting a variable. There are also similar injects that deal with shell commands and/or file path traversals. Whether it's a "good bot" or a bad bot is unknown, but if the inject works, you have bigger issues to deal with. There's a 99% chance your site is not generating these style links and there is nothing you can do to stop them from crafting those urls unless you block the request(s) with a simple regex string or a more complex WAF such as ModSecurity.
Blocking based on user agent is not an effective angle. You need to look for the request heuristics and block based on that instead. Some examples of things to look for in the url/request/POST/referrer, as both utf-8 and hex characters:
double apostrophes
double periods, especially followed by a slash in various encodings
words like "script", "etc" or "passwd"
paths like dev/null used with piping/echoing shell output
%00 null byte style characters used for init a new command
http in the url more than once (unless your site uses it)
anything regarding cgi (unless your site uses it)
random "enterprise" paths for things like coldfusion, tomcat, etc
If you aren't using a WAF, here is a regex concat that should capture many of those within a url. We use it in PHP apps, so you may/will need to tweak some escapes/looks depending on where you are using this. Note that this has .cgi, wordpress, and wp-admin along with a bunch of other stuff in the regex, remove them if you need to.
$invalid = "(\(\))"; // lets not look for quotes. [good]bots use them constantly. looking for () since technically parenthesis arent valid
$period = "(\\002e|%2e|%252e|%c0%2e|\.)";
$slash = "(\\2215|%2f|%252f|%5c|%255c|%c0%2f|%c0%af|\/|\\\)"; // http://security.stackexchange.com/questions/48879/why-does-directory-traversal-attack-c0af-work
$routes = "(etc|dev|irj)" . $slash . "(passwds?|group|null|portal)|allow_url_include|auto_prepend_file|route_*=http";
$filetypes = $period . "+(sql|db|sqlite|log|ini|cgi|bak|rc|apk|pkg|deb|rpm|exe|msi|bak|old|cache|lock|autoload|gitignore|ht(access|passwds?)|cpanel_config|history|zip|bz2|tar|(t)?gz)";
$cgis = "cgi(-|_){0,1}(bin(-sdb)?|mod|sys)?";
$phps = "(changelog|version|license|command|xmlrpc|admin-ajax|wsdl|tmp|shell|stats|echo|(my)?sql|sample|modx|load-config|cron|wp-(up|tmp|sitemaps|sitemap(s)?|signup|settings|" . $period . "?config(uration|-sample|bak)?))" . $period . "php";
$doors = "(" . $cgis . $slash . "(common" . $period . "(cgi|php))|manager" . $slash . "html|stssys" . $period . "htm|((mysql|phpmy|db|my)admin|pma|sqlitemanager|sqlite|websql)" . $slash . "|(jmx|web)-console|bitrix|invoker|muieblackcat|w00tw00t|websql|xampp|cfide|wordpress|wp-admin|hnap1|tmunblock|soapcaller|zabbix|elfinder)";
$sqls = "((un)?hex\(|name_const\(|char\(|a=0)";
$nulls = "(%00|%2500)";
$truth = "(.{1,4})=\1"; // catch OR always-true (1=1) clauses via sql inject - not used atm, its too broad and may capture search=chowder (ch=ch) for example
$regex = "/$invalid|$period{1,2}$slash|$routes|$filetypes|$phps|$doors|$sqls|$nulls/i";
Using it, at least with PHP, is pretty straight forward with preg_match_all(). Here is an example of how you can use it: https://gist.github.com/dhaupin/605b35ca64ca0d061f05c4cf423521ab
WARNING: Be careful if you set this to autoban (ie, fail2ban filter). MS/Bing DumbBots (and others) often muck up urls by entering things like strange triple dots from following truncated urls, or trying to hit a tel: link as a URi. I don't know why. Here is what i mean: A link with text www.example.com/link-too-long...truncated.html may point to a correct url, but Bing may try to access it "as it looks" instead of following the href, resulting in a WAF hit due to double dots.
since this is a very old version of FireFox, I blocked it in my htaccess file -
RewriteCond %{HTTP_USER_AGENT} Firefox/3\.5\.2 [NC]
RewriteRule .* err404.php [R,L]
I am using Neo4j on my browser on Ubuntu. I got over 1 million nodes and I want to export them as csv file.
When return data size is small like "match n return n limit 3" there is a big fat "download csv" button I could use. But when it comes to big result set like over 1000 the shell just says "Resultset too large(over 1000 rows)" and the button doesnt show up.
How can I export csv files for large resultset?
You can also use my shell extensions to export cypher results to CSV.
See here: https://github.com/jexp/neo4j-shell-tools#cypher-import
Just provide an -o output.csv file to the import-cypher command.
Well, I just used linux shell to do all the job.
neo4j-shell -file query.cql | sed 's/|/;/g' > myfile.csv
In my case, I had also to convert from UTF-8 to ISO-8859-1 so I typed:
neo4j-shell -file query.cql | sed 's/|/;/g' | iconv -f UTF-8 -t ISO-8859-1 -o myfile.csv
PS: sed performs the replace: 's/|/;/g' means, substitute (s) all "|" to ";" even though there is more than one per line (g)
Hope this can help.
Regards
We followed the approach below using mentioned. It works very well for us.
data is formatted properly in csv format.
https://github.com/jexp/neo4j-shell-tools#cypher-import
import-cypher command from neo4J shell.
neo4j-sh (?)$ import-cypher -o test.csv MATCH (m:TDSP) return m.name
I know this is an old post, but maybe this will help someone else. For anyone using Symfony Framework, you can make a pretty simple controller to export Neo4j Cypher Queries as CSV. This uses the Graphaware NEO4J PHP OGM ( https://github.com/graphaware/neo4j-php-ogm ) to run raw cypher queries. I guess this can also easily be implemented without Symfony only using plain PHP.
Just make a form (with twig if you want):
<form action="{{ path('admin_exportquery') }}" method="get">
Cypher:<br>
<textarea name="query"></textarea><br>
<input type="submit" value="Submit">
</form>
Then set up the route of "admin_exportquery" to point to the controller. And add a controller to handle the export:
public function exportQueryAction(Request $request)
{
$query = $request->query->get('query');
$em = $this->get('neo4j.graph_manager')->getClient();
$response = new StreamedResponse(function() use($em,$query) {
$result = $em->getDatabaseDriver()->run($query);
$handle = fopen('php://output', 'w');
fputs( $handle, "\xEF\xBB\xBF" );
$header = $result->getRecords()[0]->keys();
fputcsv($handle, $header);
foreach($result->getRecords() as $r){
fputcsv($handle, $r->values());
}
fclose($handle);
});
$response->headers->set('Content-Type', 'application/force-download');
$response->headers->set('Cache-Control', 'no-store, no-cache');
$response->headers->set('Content-Disposition','attachment; filename="export.csv"');
return $response;
}
This lets you download a CSV with utf-8 characters directly from your browser and gives you all the freedom of Cypher.
IMPORTANT: This has no query check what so ever and it is a very good idea to set up appropriate security or query check before using :)
The cypher-shell:
https://neo4j.com/docs/operations-manual/current/tools/cypher-shell/
that is included in the latest versions of neo4j does this easily:
cat query.cql | cypher-shell > answer.csv
The limit 1000 is due to the browser MaxRows setting. You can change it to e.g. 10000 and thus be able to download those 10000 in one go via the download/export csv button described in the original post. On my laptop, the limit for the download button is somewhere between 10000 and 20000.
By setting the MaxRows limit to 50000 or 300000, I have been able to get the data on screen after waiting a while. Manual select, copy, and paste works. However, doing anything else than closing the browser after each query has not been possible as the browser becomes very slow.
There is another option to export the data as CSV using the cURL command with http/https neo4j transaction/commit endpoint.
here is an example on how to do that.
curl -H accept:application/json -H content-type:application/json \
-d '{"statements":[{"statement":"MATCH (p1:PROFILES)-[:RELATION]-(p2) RETURN ... LIMIT 4"}]}' \
http://localhost:7474/db/data/transaction/commit \
| jq -r '(.results[0]) | .columns,.data[].row | #csv'
and this command uses jq. make sure jq is installed in your machine and this converts the results to the CSV format.
Note: You may need to pass Authorization in the header for authentication.
Just pass -H 'Authorization: Basic XXXXX=' \ to avoid 401
here is the blog with a detailed explanation.
https://neo4j.com/blog/export-csv-from-neo4j-curl-cypher-jq/
I am trying to identify youtube link (generally), and I wonder what top-level domains is youtube using?
So far I know about:
.com (youtube.com)
.be (youtu.be)
Are there any others?
PS: for those looking for checking youtube/vimeo video particulary I would recommend to check how to check the valid Youtube url using jquery ...
At the moment, YouTube videos can be accessed by two kinds of link, either the usual URL generated by the UI itself, as you click from one video to the next:
https://www.youtube.com/watch?v=tRgl-78sDX2
Or through a sharing URL, created within the UI by clicking the "share" button:
https://youtu.be/6DzSAaNQHR8
Regarding the second part of your question, what domains are required for access to YouTube? Unfortunately this is a moving target, as the YouTube changes. At the time of writing (Mar 2015), that list is as follows:
*.youtube.com
*.googlevideo.com
*.ytimg.com
In addition to the core domains above, some ancillary domains are also needed to display the ads, etc on most YouTube pages:
apis.google.com
*.googleusercontent.com
*.gstatic.com
The bulk of the traffic itself comes from *.googlevideo.com. Beginning in late 2014, YouTube started progressively enabling SSL on the youtube.com and googlevideo.com domains. You'll need to be aware of this if you're using this information on a filtering or caching device.
Also note this was tested through a browser; native clients (iOS, Android, etc) may operate differently.
I have discovered these Youtube domains so far:
youtube.com // https://www.youtube.com/watch?v=dfGZ8NyGFS8
youtu.be // https://youtu.be/dfGZ8NyGFS8
youtube-nocookie.com // https://www.youtube-nocookie.com/embed/dfGZ8NyGFS8
also youtube.com has a mobile version on subdomain:
m.youtube.com // https://m.youtube.com/watch?v=dfGZ8NyGFS8
I use this list for PHP function parse_url. You may find this question usefull too.
Here are some of the main domains, subdomains and gTLDs related to youtube
1st Level (domains):
.googlevideo.com
.youtu.be
.youtube.com
.youtube.com.br
.youtube.co.nz
.youtube.de
.youtube.es
.youtube.it
.youtube.nl
.youtube-nocookie.com
.youtube.ru
.ytimg.com
2nd Level (subdomains):
.video-stats.l.google.com
.youtube.googleapis.com
.youtubei.googleapis.com
.ytimg.l.google.com
gTLD:
.youtube
gTLD subdomains:
.rewind.youtube
.blog.youtube
Here are all (147) top-level URLs which redirected to youtube.com that I could find (see below for how I got these URLs):
www.youtube.ae https://www.youtube.com/?gl=AE
www.youtube.at https://www.youtube.com/?gl=AT
www.youtube.az https://www.youtube.com/?gl=AZ
www.youtube.ba https://www.youtube.com/?gl=BA
www.youtube.be https://www.youtube.com/?gl=BE
www.youtube.bg https://www.youtube.com/?gl=BG
www.youtube.bh https://www.youtube.com/?gl=BH
www.youtube.bo https://www.youtube.com/?gl=BO
www.youtube.by https://www.youtube.com/?gl=BY
www.youtube.ca https://www.youtube.com/?gl=CA
www.youtube.cat https://www.youtube.com/?gl=ES
www.youtube.ch https://www.youtube.com/?gl=CH
www.youtube.cl https://www.youtube.com/?gl=CL
www.youtube.co https://www.youtube.com/?gl=CO
www.youtube.co.ae https://www.youtube.com/?gl=AE
www.youtube.co.at https://www.youtube.com/?gl=AT
www.youtube.co.cr https://www.youtube.com/?gl=CR
www.youtube.co.hu https://www.youtube.com/?gl=HU
www.youtube.co.id https://www.youtube.com/?gl=ID
www.youtube.co.il https://www.youtube.com/?gl=IL
www.youtube.co.in https://www.youtube.com/?gl=IN
www.youtube.co.jp https://www.youtube.com/?gl=JP
www.youtube.co.ke https://www.youtube.com/?gl=KE
www.youtube.co.kr https://www.youtube.com/?gl=KR
www.youtube.co.ma https://www.youtube.com/?gl=MA
www.youtube.co.nz https://www.youtube.com/?gl=NZ
www.youtube.co.th https://www.youtube.com/?gl=TH
www.youtube.co.tz https://www.youtube.com/?gl=TZ
www.youtube.co.ug https://www.youtube.com/?gl=UG
www.youtube.co.uk https://www.youtube.com/?gl=GB
www.youtube.co.ve https://www.youtube.com/?gl=VE
www.youtube.co.za https://www.youtube.com/?gl=ZA
www.youtube.co.zw https://www.youtube.com/?gl=ZW
www.youtube.com https://www.youtube.com/
www.youtube.com.ar https://www.youtube.com/?gl=AR
www.youtube.com.au https://www.youtube.com/?gl=AU
www.youtube.com.az https://www.youtube.com/?gl=AZ
www.youtube.com.bd https://www.youtube.com/?gl=BD
www.youtube.com.bh https://www.youtube.com/?gl=BH
www.youtube.com.bo https://www.youtube.com/?gl=BO
www.youtube.com.br https://www.youtube.com/?gl=BR
www.youtube.com.by https://www.youtube.com/?gl=BY
www.youtube.com.co https://www.youtube.com/?gl=CO
www.youtube.com.do https://www.youtube.com/?gl=DO
www.youtube.com.ec https://www.youtube.com/?gl=EC
www.youtube.com.ee https://www.youtube.com/?gl=EE
www.youtube.com.eg https://www.youtube.com/?gl=EG
www.youtube.com.es https://www.youtube.com/?gl=ES
www.youtube.com.gh https://www.youtube.com/?gl=GH
www.youtube.com.gr https://www.youtube.com/?gl=GR
www.youtube.com.gt https://www.youtube.com/?gl=GT
www.youtube.com.hk https://www.youtube.com/?gl=HK
www.youtube.com.hn https://www.youtube.com/?gl=HN
www.youtube.com.hr https://www.youtube.com/?gl=HR
www.youtube.com.jm https://www.youtube.com/?gl=JM
www.youtube.com.jo https://www.youtube.com/?gl=JO
www.youtube.com.kw https://www.youtube.com/?gl=KW
www.youtube.com.lb https://www.youtube.com/?gl=LB
www.youtube.com.lv https://www.youtube.com/?gl=LV
www.youtube.com.ly https://www.youtube.com/?gl=LY
www.youtube.com.mk https://www.youtube.com/?gl=MK
www.youtube.com.mt https://www.youtube.com/?gl=MT
www.youtube.com.mx https://www.youtube.com/?gl=MX
www.youtube.com.my https://www.youtube.com/?gl=MY
www.youtube.com.ng https://www.youtube.com/?gl=NG
www.youtube.com.ni https://www.youtube.com/?gl=NI
www.youtube.com.om https://www.youtube.com/?gl=OM
www.youtube.com.pa https://www.youtube.com/?gl=PA
www.youtube.com.pe https://www.youtube.com/?gl=PE
www.youtube.com.ph https://www.youtube.com/?gl=PH
www.youtube.com.pk https://www.youtube.com/?gl=PK
www.youtube.com.pt https://www.youtube.com/?gl=PT
www.youtube.com.py https://www.youtube.com/?gl=PY
www.youtube.com.qa https://www.youtube.com/?gl=QA
www.youtube.com.ro https://www.youtube.com/?gl=RO
www.youtube.com.sa https://www.youtube.com/?gl=SA
www.youtube.com.sg https://www.youtube.com/?gl=SG
www.youtube.com.sv https://www.youtube.com/?gl=SV
www.youtube.com.tn https://www.youtube.com/?gl=TN
www.youtube.com.tr https://www.youtube.com/?gl=TR
www.youtube.com.tw https://www.youtube.com/?gl=TW
www.youtube.com.ua https://www.youtube.com/?gl=UA
www.youtube.com.uy https://www.youtube.com/?gl=UY
www.youtube.com.ve https://www.youtube.com/?gl=VE
www.youtube.cr https://www.youtube.com/?gl=CR
www.youtube.cz https://www.youtube.com/?gl=CZ
www.youtube.de https://www.youtube.com/?gl=DE
www.youtube.dk https://www.youtube.com/?gl=DK
www.youtube.ee https://www.youtube.com/?gl=EE
www.youtube.es https://www.youtube.com/?gl=ES
www.youtube.fi https://www.youtube.com/?gl=FI
www.youtube.fr https://www.youtube.com/?gl=FR
www.youtube.ge https://www.youtube.com/?gl=GE
www.youtube.gr https://www.youtube.com/?gl=GR
www.youtube.gt https://www.youtube.com/?gl=GT
www.youtube.hk https://www.youtube.com/?gl=HK
www.youtube.hr https://www.youtube.com/?gl=HR
www.youtube.hu https://www.youtube.com/?gl=HU
www.youtube.ie https://www.youtube.com/?gl=IE
www.youtube.in https://www.youtube.com/?gl=IN
www.youtube.iq https://www.youtube.com/?gl=IQ
www.youtube.is https://www.youtube.com/?gl=IS
www.youtube.it https://www.youtube.com/?gl=IT
www.youtube.jo https://www.youtube.com/?gl=JO
www.youtube.jp https://www.youtube.com/?gl=JP
www.youtube.kr https://www.youtube.com/?gl=KR
www.youtube.kz https://www.youtube.com/?gl=KZ
www.youtube.lk https://www.youtube.com/?gl=LK
www.youtube.lt https://www.youtube.com/?gl=LT
www.youtube.lu https://www.youtube.com/?gl=LU
www.youtube.lv https://www.youtube.com/?gl=LV
www.youtube.ly https://www.youtube.com/?gl=LY
www.youtube.ma https://www.youtube.com/?gl=MA
www.youtube.me https://www.youtube.com/?gl=ME
www.youtube.mk https://www.youtube.com/?gl=MK
www.youtube.mx https://www.youtube.com/?gl=MX
www.youtube.my https://www.youtube.com/?gl=MY
www.youtube.net.in https://www.youtube.com/
www.youtube.ng https://www.youtube.com/?gl=NG
www.youtube.ni https://www.youtube.com/?gl=NI
www.youtube.nl https://www.youtube.com/?gl=NL
www.youtube.no https://www.youtube.com/?gl=NO
www.youtube.pa https://www.youtube.com/?gl=PA
www.youtube.pe https://www.youtube.com/?gl=PE
www.youtube.ph https://www.youtube.com/?gl=PH
www.youtube.pk https://www.youtube.com/?gl=PK
www.youtube.pl https://www.youtube.com/?gl=PL
www.youtube.pr https://www.youtube.com/?gl=PR
www.youtube.pt https://www.youtube.com/?gl=PT
www.youtube.qa https://www.youtube.com/?gl=QA
www.youtube.ro https://www.youtube.com/?gl=RO
www.youtube.rs https://www.youtube.com/?gl=RS
www.youtube.ru https://www.youtube.com/?gl=RU
www.youtube.sa https://www.youtube.com/?gl=SA
www.youtube.se https://www.youtube.com/?gl=SE
www.youtube.sg https://www.youtube.com/?gl=SG
www.youtube.si https://www.youtube.com/?gl=SI
www.youtube.sk https://www.youtube.com/?gl=SK
www.youtube.sn https://www.youtube.com/?gl=SN
www.youtube.sv https://www.youtube.com/?gl=SV
www.youtube.tn https://www.youtube.com/?gl=TN
www.youtube.tv https://tv.youtube.com/welcome/
www.youtube.ua https://www.youtube.com/?gl=UA
www.youtube.ug https://www.youtube.com/?gl=UG
www.youtube.uy https://www.youtube.com/?gl=UY
www.youtube.vn https://www.youtube.com/?gl=VN
www.youtube.voto https://www.youtube.com/
Here is the first column of the above table (suitable for copying and pasting):
www.youtube.ae
www.youtube.at
www.youtube.az
www.youtube.ba
www.youtube.be
www.youtube.bg
www.youtube.bh
www.youtube.bo
www.youtube.by
www.youtube.ca
www.youtube.cat
www.youtube.ch
www.youtube.cl
www.youtube.co
www.youtube.co.ae
www.youtube.co.at
www.youtube.co.cr
www.youtube.co.hu
www.youtube.co.id
www.youtube.co.il
www.youtube.co.in
www.youtube.co.jp
www.youtube.co.ke
www.youtube.co.kr
www.youtube.co.ma
www.youtube.co.nz
www.youtube.co.th
www.youtube.co.tz
www.youtube.co.ug
www.youtube.co.uk
www.youtube.co.ve
www.youtube.co.za
www.youtube.co.zw
www.youtube.com
www.youtube.com.ar
www.youtube.com.au
www.youtube.com.az
www.youtube.com.bd
www.youtube.com.bh
www.youtube.com.bo
www.youtube.com.br
www.youtube.com.by
www.youtube.com.co
www.youtube.com.do
www.youtube.com.ec
www.youtube.com.ee
www.youtube.com.eg
www.youtube.com.es
www.youtube.com.gh
www.youtube.com.gr
www.youtube.com.gt
www.youtube.com.hk
www.youtube.com.hn
www.youtube.com.hr
www.youtube.com.jm
www.youtube.com.jo
www.youtube.com.kw
www.youtube.com.lb
www.youtube.com.lv
www.youtube.com.ly
www.youtube.com.mk
www.youtube.com.mt
www.youtube.com.mx
www.youtube.com.my
www.youtube.com.ng
www.youtube.com.ni
www.youtube.com.om
www.youtube.com.pa
www.youtube.com.pe
www.youtube.com.ph
www.youtube.com.pk
www.youtube.com.pt
www.youtube.com.py
www.youtube.com.qa
www.youtube.com.ro
www.youtube.com.sa
www.youtube.com.sg
www.youtube.com.sv
www.youtube.com.tn
www.youtube.com.tr
www.youtube.com.tw
www.youtube.com.ua
www.youtube.com.uy
www.youtube.com.ve
www.youtube.cr
www.youtube.cz
www.youtube.de
www.youtube.dk
www.youtube.ee
www.youtube.es
www.youtube.fi
www.youtube.fr
www.youtube.ge
www.youtube.gr
www.youtube.gt
www.youtube.hk
www.youtube.hr
www.youtube.hu
www.youtube.ie
www.youtube.in
www.youtube.iq
www.youtube.is
www.youtube.it
www.youtube.jo
www.youtube.jp
www.youtube.kr
www.youtube.kz
www.youtube.lk
www.youtube.lt
www.youtube.lu
www.youtube.lv
www.youtube.ly
www.youtube.ma
www.youtube.me
www.youtube.mk
www.youtube.mx
www.youtube.my
www.youtube.net.in
www.youtube.ng
www.youtube.ni
www.youtube.nl
www.youtube.no
www.youtube.pa
www.youtube.pe
www.youtube.ph
www.youtube.pk
www.youtube.pl
www.youtube.pr
www.youtube.pt
www.youtube.qa
www.youtube.ro
www.youtube.rs
www.youtube.ru
www.youtube.sa
www.youtube.se
www.youtube.sg
www.youtube.si
www.youtube.sk
www.youtube.sn
www.youtube.sv
www.youtube.tn
www.youtube.tv
www.youtube.ua
www.youtube.ug
www.youtube.uy
www.youtube.vn
www.youtube.voto
There are also the following domains which are not included in the table:
m.youtube.com
m.youtube.<country code>
youtu.be (with and without the trailing dot)
<country code>.youtube.com (e.g. jp.youtube.com)
youtube.com. (with the trailing dot)
youtube.<country code>. (with the trailing dot)
domains like gaming.youtube.com in YouTube's sitemap https://www.youtube.com/sitemaps/sitemap.xml (see linked sub-sitemaps)
For comprehensiveness, I went through Wikipedia's 676 https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes, took all possible two-letter language codes and appended them to the URL as https://www.youtube.com/?gl=<language code>, retrieved the URL, and checked if the source code contained the text "gl":"<language code>". If the language code is not found (e.g. ZZZZZ), the source code will use the default language code of "gl":"CA". Additionally, I went through Mozilla's Public Suffix list https://wiki.mozilla.org/Public_Suffix_List and took all of those and tried every single one.
Here are the following URLs that I couldn't find a corresponding TLD for (it doesn't imply that there should be one, however) which had the TLD code in the URL also in the HTML source:
https://www.youtube.com/?gl=CY
https://www.youtube.com/?gl=LI
https://www.youtube.com/?gl=NP
https://www.youtube.com/?gl=PG
https://www.youtube.com/?gl=YE
Bash script to get URL redirects (sometimes this script doesn't follow the redirect; I am not sure if it is because of my flaky internet, a rate-limit, or because of the script):
while read line; do echo "$line -> "$(curl -Ls -o /dev/null -w %{url_effective} $line); done < youtube_urls.txt;
There's even .youtube, which is quite cool I think.
See https://lifeinaday.youtube/ for example.
If you are trying to catch youtube video links I am using this regexp part I don't know they use national domains for example if you go to youtube.mx they redirect you back to .com with Mexico localizations... I guess they figured out that they cannot host content on some national domains for example if you use domain of Saudi Arabia you have to accept all local laws of Saudi Arabia and if somebody put content that is in violation of their laws owner of website might be in serious legal trouble... so they are using .com for everything and keep content under jurisdiction and laws of US
https?:\/\/ # Either http or https
(?:[\w]+\.)* # Optional subdomains
(?: # Group host alternatives.
youtu\.be/ # Either youtu.be,
| youtube\.com # or youtube.com
| youtube-nocookie\.com # or youtube-nocookie.com
) # End Host Group
By this page http://data.iana.org/TLD/tlds-alpha-by-domain.txt
There is also a .youtube domain itself.
As in the domain https://rewind.youtube
This gave me an idea. Why not use these domains as channel redirects.
eg. My channel (WMTV) will have the domain https://wmtv.youtube
Of course, this would be limited to bigger creators, but as time goes on, this might be a great feature from an individual YouTuber´s marketing standpoint.
I am trying to download YouTube videos through Wget. The first thing necessary is to capture the URL of the actual video resource. Suppose I want to download this video: video. Opening up the page in the Firebug console reveals something like this:
The link which I have encircled looks like the link to the resource, for there we see only the video: http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1. However, when I am trying to download this resource with Wget, a 4 KB file of name r-KBncrOggI#version=3&autohide=1 gets stored in my hard-drive, nothing else. What should I do to get the actual video?
And secondly, is there a way to capture different resources for videos of different resolutions, like 360px, 480px, etc.?
Here is one VERY simplified, yet functional version of the youtube-download utility I cited on my another answer:
#!/usr/bin/env perl
use strict;
use warnings;
# CPAN modules we depend on
use JSON::XS;
use LWP::UserAgent;
use URI::Escape;
# Initialize the User Agent
# YouTube servers are weird, so *don't* parse headers!
my $ua = LWP::UserAgent->new(parse_head => 0);
# fetch video page or abort
my $res = $ua->get($ARGV[0]);
die "bad HTTP response" unless $res->is_success;
# scrape video metadata
if ($res->content =~ /\byt\.playerConfig\s*=\s*({.+?});/sx) {
# parse as JSON or abort
my $json = eval { decode_json $1 };
die "bad JSON: $1" if $#;
# inside the JSON 'args' property, there's an encoded
# url_encoded_fmt_stream_map property which points
# to stream URLs and signatures
while ($json->{args}{url_encoded_fmt_stream_map} =~ /\burl=(http.+?)&sig=([0-9A-F\.]+)/gx) {
# decode URL and attach signature
my $url = uri_unescape($1) . "&signature=$2";
print $url, "\n";
}
}
Usage example (it returns several URLs to streams with different encoding/quality):
$ perl youtube.pl http://www.youtube.com/watch?v=r-KBncrOggI | head -n 1
http://r19---sn-bg07sner.c.youtube.com/videoplayback?fexp=923014%2C916623%2C920704%2C912806%2C922403%2C922405%2C929901%2C913605%2C925710%2C929104%2C929110%2C908493%2C920201%2C913302%2C919009%2C911116%2C926403%2C910221%2C901451&ms=au&mv=m&mt=1357996514&cp=U0hUTVBNUF9FUUNONF9IR1RCOk01RjRyaG4wTHdQ&id=afe2819dcace8202&ratebypass=yes&key=yt1&newshard=yes&expire=1358022107&ip=201.52.68.216&ipbits=8&upn=m-kyX9-4Tgc&sparams=cp%2Cid%2Cip%2Cipbits%2Citag%2Cratebypass%2Csource%2Cupn%2Cexpire&itag=44&sver=3&source=youtube,quality=large&signature=A1E7E91DD087067ED59101EF2AE421A3503C7FED.87CBE6AE7FB8D9E2B67FEFA9449D0FA769AEA739
I'm afraid it's not that easy do get the right link for the video resource.
The link you got, http://www.youtube.com/v/r-KBncrOggI?version=3&autohide=1, points to the player rather than the video itself. There is one Perl utility, youtube-download, which is well-maintained and does the trick. This is how to get the HQ version (magic fmt=18) of that video:
stas#Stanislaws-MacBook-Pro:~$ youtube-download -o "{title}.{suffix}" --fmt 18 r-KBncrOggI
--> Working on r-KBncrOggI
Downloading `Sourav Ganguly in Farhan Akhtar's Show - Oye! It's Friday!.mp4`
75161060/75161060 (100.00%)
Download successful!
stas#Stanislaws-MacBook-Pro:~$
There might be better command-line YouTube Downloaders around. But sorry, one doesn't simply download a video using Firebug and wget any more :(
The only way I know to capture that URL manually is by watching the active downloads of the browser:
That largest data chunks are video data, so you can copy its URL:
http://s.youtube.com/s?lact=111116&uga=m30&volume=4.513679238953965&sd=BBE62AA4AHH1357937949850490&rendering=accelerated&fs=0&decoding=software&nsivbblmax=679542.000&hcbt=105.345&sendtmp=1&fmt=35&w=640&vtmp=1&referrer=None&hl=en_US&nsivbblmin=486355.000&nsivbblmean=603805.166&md=1&plid=AATTCZEEeM825vCx&ns=yt&ptk=youtube_none&csipt=watch7&rt=110.904&tsphab=1&nsiabblmax=129097.000&tspne=0&tpmt=110&nsiabblmin=123113.000&tspfdt=436&hbd=30900552&et=110.146&hbt=30.770&st=70.213&cfps=25&cr=BR&h=480&screenw=1440&nsiabblmean=125949.872&cpn=JlqV9j_oE1jzk7Zc&nsivbblc=343&nsiabblc=343&docid=r-KBncrOggI&len=1302.676&screenh=900&abd=1&pixel_ratio=1&bc=26131333&playerw=854&idpj=0&hcbd=25408143&playerh=510&ldpj=0&fexp=920704,919009,922403,916709,912806,929110,928008,920201,901451,909708,913605,925710,916623,929104,913302,910221,911116,914093,922405,929901&scoville=1&el=detailpage&bd=6676317&nsidf=1&vid=Yfg8gnutZoTD4G5SVKCxpsPvirbqG7pvR&bt=40.333&mos=0&vq=auto
However, for a large video, this will only return a part of the stream unless you figure out the URL query parameter responsible for stream range to be downloaded and adjust it.
A bonus: everything changes periodically as YouTube is constantly evolving. So, don't do that manually unless you carve pain.
The URL http://www.fourmilab.ch/cgi-bin/Earth shows a live map of the Earth.
If I issue this URL in my browser (FF), the image shows up just fine. But when I try 'wget' to fetch the same page, I fail!
Here's what I tried first:
wget -p http://www.fourmilab.ch/cgi-bin/Earth
Thinking, that probably all other form fields are required too, I did a 'View Source' on the above page, noted down the various field values, and then issued the following URL:
wget --post-data "opt=-p&lat=7°27'&lon=50°49'&ns=North&ew=East&alt=150889769&img=learth.evif&date=1&imgsize=320&daynight=-d" http://www.fourmilab.ch/cgi-bin/Earth
Still no image!
Can someone please tell me what is going on here...? Are there any 'gotchas' with CGI and/or form-POST based wgets? Where (book or online resource) would such concepts be explained?
If you will inspect the page's source code, there's a link with img inside, that contains the image of earth. For example:
<img
src="/cgi-bin/Earth?di=570C6ABB1F33F13E95631EFF088262D5E20F2A10190A5A599229"
ismap="ismap" usemap="#zoommap" width="320" height="320" border="0" alt="" />
Without giving the 'di' parameter, you are just asking for whole web page, with references to this image, not for the image itself.
Edit: 'Di' parameter encodes which "part" of the earth you want to receive, anyway, try for example
wget http://www.fourmilab.ch/cgi-bin/Earth?di=F5AEC312B69A58973CCAB756A12BCB7C47A9BE99E3DDC5F63DF746B66C122E4E4B28ADC1EFADCC43752B45ABE2585A62E6FB304ACB6354E2796D9D3CEF7A1044FA32907855BA5C8F
Use GET instead of POST. They're completely different for the CGI program in the background.
Following on from Ravadre,
wget -p http://www.fourmilab.ch/cgi-bin/Earth
downloads an XHTML file which contain an <img> tag.
I edited the XHTML to remove everything but the img tag and turned it into a bash script containing another wget -p command, escaping the ? and =
When I executed this I got a 14kB file which I renamed earth.jpg
Not really programmatic, the way I did it, but I think it could be done.
But as #somedeveloper said, the di value is changing (since it depends on time).
Guys, here's what I finally did. Not fully happy with this solution, as I was (and am still) hoping for a better way... one that gets the image on the first wget itself... giving me the same user experience I get when browsing via firefox.
#!/bin/bash
tmpf=/tmp/delme.jpeg
base=http://www.fourmilab.ch
liveurl=$(wget -O - $base/cgi-bin/Earth?opt=-p 2>/dev/null | perl -0777 -nle 'if(m#<img \s+ src \s* = \s* "(/cgi-bin/Earth\?di= .*? )" #gsix) { print "$1\n" }' )
wget -O $tmpf $base/$liveurl &>/dev/null
What you are downloading is the whole HTML page and not the image. To download the image and other elements too, you'll need to use the --page-requisites (and possibly --convert-links) parameter(s). Unfortunately because robots.txt disallows access to URLs under /cgi-bin/, wget will not download the image which is located under /cgi-bin/. AFAIK there's no parameter to disable the robots protocol.