Is there a function in Tcl/Tk to show all available urls from a link? I want to start to programm a webcrawler with some features.
For example:
the user types this:
"www.testsite.com"
and he will get that:
"www.testsite.com/dir1/"
"www.testsite.com/dir2/"
e.g.
Or is it better to programm it with other language like phyton?
br
It's pretty easy to do with the http and tDOM packages. You just need to know a bit of XPath…
package require http
package require tdom
set tok [http::geturl http://example.com/index.html]
set html [http::data $tok]
http::cleanup $tok
set doc [dom parse -html $html]
foreach anchor [$doc selectNodes "//a"] {
puts [$anchor #href]
}
Related
In LibreOffice, It is possible to run python scripts like this:
sURL = "vnd.sun.star.script:file.function?language=Python&location=document"
oScript = scriptProv.getScript(sURL)
x = oScript.Invoke(args, Array(), Array())
In that example 'file' is a filename, and 'function' is the name of a function in that file.
Is it possible to embed script in that URL? sURL="vnd.." & scriptblock & "?language.."
(It seems like the kind of thing that might be possible with the correct URL, or might not be possible if just not supported).
We can use Python's eval() function. Here is an example inspired by JohnSUN's explanation in the discussion. Note: xray() uses XrayTool to show output, but you could replace that line with any output method of your choosing, such as writing to a file.
def runArbitraryCode(*args):
url = args[0]
codeString = url.split("&codeToRun=")[1]
x = eval(codeString)
xray(x)
Now enter this formula in Calc and Ctrl+click on it.
=HYPERLINK("vnd.sun.star.script:misc_examples.py$runArbitraryCode?language=Python&location=user&codeToRun=5+1")
Result: 6
Obligatory caveat: Running eval() on an unknown string is about the worst idea imaginable in terms of security. So hopefully you're the one controlling the URL and not some black hat hacker!
I want to create a printer extension for OCaml using camlp5. My code would look like the example of this tutorial but instead of creating my own extension of the grammar, I would like to use OCaml's grammar to parse a program.
For that, I would like to use the Pcaml module to parse the given string with OCaml's grammar. Unfortunately, each time I try to use it, I get the:
Required module 'Pcaml' is unavailable
This is the part of my code where I load and open modules, as well as part of the code that uses Pcaml:
#load "pa_extprint.cmo";;
#load "q_MLast.cmo";;
#load "pa_o.cmo";;
open Pcaml;;
open Pprintf;;
let pa_ocaml = Grammar.Entry.create Pcaml.gram "pcaml_gram";;
I tried multiple command to run the program, like for example:
ocamlc -pp camlp5o -I +camlp5 gramlib.cma <my_file>.ml
What do I need to be able to use Pcaml and Pcaml.gram?
I recommend to use ocamlfind to build and link your programs. The only reason for newcomer against it, is that thing could become buggy when you use Windows without WSL. The compilation command without error is below
ocamlfind c -syntax camlp5o -package camlp5 -linkpkg a.ml
#load "pa_extprint.cmo";;
#load "q_MLast.cmo";;
#load "pa_o.cmo";;
open Pcaml;;
open Pprintf;;
let pa_ocaml : int Grammar.Entry.e = Grammar.Entry.create Pcaml.gram "pcaml_gram";;
FYI, your #load commands can and should be replaced by specifying right ocamlfind's packages.
I'm trying to make a Greasemonkey script that passes me from this:
http://redirector/referal_ID:site#link
to this:
link
In other words, I need to delete the first part of the links that I click on, bypassing the redirector pages http://redirector/referal_ID:site# and keep only what is after the # character the link.
Note that redirector changes frequently, referal_id is always unique and different, and site# is the only constant string in all of the links.
I've tried to modify various scripts but my, next to null, knowledge of javascript foils all my attempts.
--------------------------------------------------------------------EDIT-----------------------------------------------------------------------
An example of what I need to do is to modify this:
http://firstfirst.net/identi_ref?q=Waterfox%2033.0.2%20[Mozilla%20Firefox%20de%2064%20bits]&ref=http://www.identi.li/c#https://shared.com/dhq1l9djj1?s=l
into this:
https://shared.com/dhq1l9djj1?s=l
The site where I want the script to work is http://www.identi.li/
The trickiest part of this is making sure the script does not fire on pages that are not redirectors. To do that, use a regex #include.
After that, it's just a matter of extracting the target site and changing the location. Here's a complete script:
// ==UserScript==
// #name _Skip redirects
// #include /site#http/
// #run-at document-start
// ==/UserScript==
var targetSite = location.href.replace (/^.+?site#(http.+)$/, "$1");
//--- Use assign() for debug or replace() to keep the browser history clean.
location.assign (targetSite);
//location.replace (targetSite);
Note that the #run-at document-start is not strictly necessary, but it can shave the response time, of a redirect script, by a fair amount.
Suppose I want to turn this :
http://en.wikipedia.org/wiki/Anarchy
into this :
en.wikipedia.org
or even better, this :
wikipedia.org
Is this even possible in regex?
Why use a regex when Ruby has a library for it? The URI library:
ruby-1.9.1-p378 > require 'uri'
=> true
ruby-1.9.1-p378 > uri = URI.parse("http://en.wikipedia.org/wiki/Anarchy")
=> #<URI::HTTP:0x000001010a2270 URL:http://en.wikipedia.org/wiki/Anarchy>
ruby-1.9.1-p378 > uri.host
=> "en.wikipedia.org"
ruby-1.9.1-p378 > uri.host.split('.')
=> ["en", "wikipedia", "org"]
Splitting the host is one way to separate the domains, but I'm not aware of a reliable way to get the base domain -- you can't just count, in the event of a URL like "http://somedomain.otherdomain.school.ac.uk" vs "www.google.com".
/http:\/\/([^\/]*).*/ will produce en.wikipedia.org from the string you provided.
/http:\/\/.{0,3}\.([^\/]*).*/ will produce wikipedia.org.
yes
Now I know you haven't asked for how, and you haven't specified a language, but I'll answer anyway... (note, this works for all language subsites, not just en.wikipedia...)
perl:
$url =~ s,http://[a-z]{2}\.(wikipedia\.org)/.*,$1,;
ruby:
url = url.sub(/http:\/\/[a-z]{2}\.(wikipedia\.org)\/.*/, '\1')
php:
$url = preg_replace('|http://[a-z]{2}.(wikipedia.org)/.*|, '$1', $url);
Of course, for this particular example, you don't even need a regex, just this will do:
url = 'wikipedia.org'
but I jest...
you probably want to handle any URL and pull out the domain part, and it should also work for domains in different countries, eg: foo.co.uk.
In which case, I'd use Mark Rushakoff's solution to get the hostname and then a regex to pull out the domain:
domain = host.sub(/^.*\.([^.]+\.[^.]+(\.[a-z]{2})?)$/, '\1')
Hope this helps
Also, if you want to learn more, I have a regex tute online: http://tech.bluesmoon.info/2006/04/beginning-regular-expressions.html
Sure all you would have to do is search on http://(.*)/wiki/Anarchy
In Perl (Sorry I don't know Ruby, but I expect it's similar)
$string_to_search =~ s/http:////(.)//. should give you wikipedia.org
to get rid of the en, you can simply search on http:////en(.)//......
That should do it.
Update: In case you're not familiar with Regex, I would recommend picking up a Regex book, this one really rocks and I like it: REGEX BOOK,Mastering Regular Expressions, I saw it on half.com the other day for 14.99 used, but to clarify what i suggested above, is to look for the string http://en, then for anything until you find a / this is all captured in $1 (in perl, not sure if it's the same in ruby), a simple print $1 will print the string.
Update: #2 sorry the star in the regex is not showing up for some reason, so where you see the . in the () and after the // just imagine a *, oh and I forgot for the en part add a /. at the end that way you don't end up with .wikipedia.org
Can someone recommend a load testing tool which allows you to either:
a. replay an IIS (7) log(s) to simulate a real live site daily run;
b. import a CSV or equivalent list of URLS so we can achieve a similar thing as above but at a URL level;
c. .net API so I can create simple tests easily from my list of URLS is also a good way to go.
I do not really want to record my tests.
I think I can do B) with WAPT but need to create an XML file manually, not too much grief, but wondering if any tools cover these scenarios out the box.
Visual Studio Test Edition would require some code to parse the file into a suitable test run.
It is a great load testing solution.
Our load testing service lets you write a very simple script using JavaScript to pull data out of a CSV file and then fetch those URLs. For example, the following code would pluck 10 random URLs from the CSV file and fetch them as part of a single session:
var c = browserMob.openHttpClient();
var csv = browserMob.getCSV("urls.csv");
browserMob.beginTransaction();
for (var i = 0; i < 10; i++) {
browserMob.beginStep("Step 1");
var url = csv.random().get("url");
c.get(url);
browserMob.endStep();
}
browserMob.endTransaction();
The CSV file itself needs to be a normal CSV file with the first row containing a header named "url". This script would be run repeatedly for each virtual user participating in a load test.
We have support for so called 'uri-format' in our open-source tool called Yandex.Tank You simply put all your uris to a file, one uri -- one line, then specify headers in your load.ini like this:
[phantom]
address=example.org
rps_schedule=line(1, 1600, 2m)
headers = [Host: mts-maps.yandex.ru]
[Connection: close] [Bloody: yes]
ammo_file = ammo.uri
ammo.uri:
/
/index.html
/1/example.html
/2/example.html