WebHarvest - Scrape data using authentication - webharvest

I am using the WebHarvest tool to scrape web data from a few websites. I have gone through the examples, but was not able to find a way to authenticate in websites and then scrape data from them.
Can anyone please cite an example configuration to achieve web data scraping through authentication? How do I send the login parameters and then receive the home page content? Appreciate your help on this.

I just modified one example (http://web-harvest.sourceforge.net/samples.php?num=4) of Web Harvest and it is running fine with login credentials. You may get updated code and try:
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<!-- sends post request with needed login information -->
<http method="post" url="http://www.nytimes.com/auth/login">
<http-param name="is_continue">true</http-param>
<http-param name="URI">http://</http-param>
<http-param name="OQ"></http-param>
<http-param name="OP"></http-param>
<http-param name="USERID">web-harvest</http-param>
<http-param name="PASSWORD">web-harvest</http-param>
</http>
<var-def name="startUrl">http://www.nytimes.com/pages/todayspaper/index.html</var-def>
<file action="write" path="D:/nytimes/nytimes${sys.date()}.xml" charset="UTF-8">
<template>
<![CDATA[ <newyourk_times date="${sys.datetime("dd.MM.yyyy")}"> ]]>
</template>
<loop item="articleUrl" index="i">
<!-- collects URLs of all articles from the front page -->
<list>
<xpath expression="//div[#class='story']">
<html-to-xml>
<http url="${startUrl}"/>
</html-to-xml>
</xpath>
</list>
<!-- downloads each article and extract data from it -->
<body>
<xquery>
<xq-param name="doc">
<var name="articleUrl"/>
</xq-param>
<xq-expression><![CDATA[
declare variable $doc as node() external;
$doc
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </newyourk_times> ]]>
</file>
</config>

Related

How to add a web view to Jira dashboard

We use Jira.. like many others.. but we also use a forum for our business discussions board and have been since before Jira existed, so we have a lot of historical information in there.
It is possible to add "Gadgets" to the dashboard, but is it possible to add a webview somewhere?
Follow this guide:
https://developer.atlassian.com/jiradev/jira-platform/guides/dashboards/tutorial-writing-gadgets-for-jira
Open src/main/resources/gadget.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<Module>
<ModulePrefs title="__MSG_gadget.title__" directory_title="__MSG_gadget.title__"
description="__MSG_gadget.description__">
<Optional feature="gadget-directory">
<Param name="categories">
JIRA
</Param>
</Optional>
<Optional feature="atlassian.util" />
<Optional feature="auth-refresh" />
<Require feature="views" />
<Require feature="settitle"/>
<Require feature="oauthpopup" />
#oauth
<Locale messages="__ATLASSIAN_BASE_URL__/download/resources/jira-gadget-tutorial-plugin/i18n/ALL_ALL.xml"/>
</ModulePrefs>
<Content type="html" view="profile">
<!-- omitted for now -->
</Content>
</Module>
Did you see:
<Content type="html" view="profile">
<!-- omitted for now -->
</Content>
Just insert your frame here:
<Content type="html" view="profile">
<iframe src="your forum url">
</Content>

Webharvest crawler script not creating XML file

I'm hoping someone can point out my (probably stupid) problem with this script. I'm trying to crawl a website to get the posts on the site and to load this into an XML document. I have tried to combine a couple of example scripts - the crawler and nytimes examples.
The script runs without error, however only the <edublogs date="02.10.2015"></edublogs> tags are exported.
Thanks in advance for your help.
<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
<!-- set initial page -->
<var-def name="home"><<SNIPPED>></var-def>
<!-- define script functions and variables -->
<script><![CDATA[
/* checks if specified URL is valid for download */
boolean isValidUrl(String url) {
String urlSmall = url.toLowerCase();
return urlSmall.startsWith("http://<<SNIPPED>>/") || urlSmall.startsWith("https://<<SNIPPED>>/");
}
/* set of unvisited URLs */
Set unvisited = new HashSet();
unvisited.add(home);
/* pushes to web-harvest context initial set of unvisited pages */
SetContextVar("unvisitedVar", unvisited);
/* set of visited URLs */
Set visited = new HashSet();
]]></script>
<file action="write" path="posts${sys.date()}.xml" charset="UTF-8">
<template>
<![CDATA[ <allposts date="${sys.datetime("dd.MM.yyyy")}"> ]]>
</template>
<!-- loop while there are any unvisited links -->
<while condition="${unvisitedVar.toList().size() != 0}">
<loop item="currUrl">
<list>
<var name="unvisitedVar"/>
</list>
<body>
<empty>
<!-- Get page content -->
<var-def name="content">
<html-to-xml>
<http url="${currUrl}"/>
</html-to-xml>
</var-def>
<!-- Get variables -->
<xquery>
<xq-param name="doc">
<var name="content"/>
</xq-param>
<xq-expression><![CDATA[
declare variable $doc as node() external;
let $title := data($doc//h1)
let $text := data($doc//div[#class="post-entry"])
let $categories := data($doc//div[#class="post-data"])
return
<post>
<title>{data($title)}</title>
<url>$(currUrl)</url>
<text>{data($text)}</text>
<categories>{data($categories)}</categories>
</post>
]]></xq-expression>
</xquery>
<!-- adds current URL to the list of visited -->
<script><![CDATA[
visited.add(sys.fullUrl(home, currUrl));
Set newLinks = new HashSet();
]]></script>
<!-- loop through all collected links on the downloaded page -->
<loop item="currLink">
<list>
<xpath expression="//a/#href">
<var name="content"/>
</xpath>
</list>
<body>
<script><![CDATA[
String fullLink = sys.fullUrl(home, currLink);
fullLink = fullLink.replaceAll("#.*","");
if ( isValidUrl(fullLink.toString()) && !visited.contains(fullLink) && !unvisitedVar.toList().contains(fullLink) && !fullLink.endsWith(".png") ) {
newLinks.add(fullLink);
}
]]></script>
</body>
</loop>
</empty>
</body>
</loop>
<!-- unvisited link are now all the collected new links from downloaded pages -->
<script><![CDATA[
SetContextVar("unvisitedVar", newLinks);
]]></script>
</while>
<![CDATA[ </posts> ]]>
</file>
Its because your while doesnt RETURN anything. Most likely because you've surrounded the body with empty - which will force no results to be returned (see manual). It sets variables etc, but doesn't return anything to "console" for file to print.

Trying to grab information in Child Link using WebHarvest

I would like to grab the information of each child link, but the program shows error. Below are my full config file. The error is Caused by: org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 724; Element type "t.length" must be followed by either attribute specifications, ">" or "/>".**
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="webpage">
<html-to-xml>
<http url="http://www.thestar.com.my/business/" />
</html-to-xml>
</var-def>
<loop item="TheStarBiz" index="i">
<list>
<xpath expression="//div[#class='nine columns mobile3']">
<var name="webpage"></var>
</xpath>
</list>
<body>
<var-def name="title">
<xpath expression="(//p[#class='m'])/a/text()">
<var name="TheStarBiz"></var>
</xpath>
</var-def>
<var-def name="link">
<xpath expression="//p[#class='m']/a/#href">
<var name="TheStarBiz"></var>
</xpath>
</var-def>
<var-def name="new_url">
<xquery>
<xq-param name="TheStarBiz"><var name="TheStarBiz"/></xq-param>
<xq-expression><![CDATA[
declare variable $TheStarBiz as node() external;
let $url := data($TheStarBiz//p[#class='m']/a/#href)
return
$url
]]></xq-expression>
</xquery>
</var-def>
<var-def name="new_page_content">
<http url="${new_url}"/>
</var-def>
<var-def name="fulldesc">
<xpath expression="//div[#class='story']">
<var name="new_page_content"/>
</xpath>
</var-def>
<var-def name="textfile">
<file action="append" type="text" path="C:\Users\jacey\Desktop\WebHarvest\test.txt">
<template>
${title} ${sys.cr}${sys.lf}
${link} ${sys.cr}${sys.lf}
${new_page_content} ${sys.cr}${sys.lf}
</template>
</file>
</var-def>
</body>
</loop>
</config>
For those who come after:
I had almost the same error and it was caused by a snippet of javascript in the file being parsed:
blah...for(var o=0;o<t.length;o++)...blah
In hindsight I suppose it's kindof obvious. In our case, this was because the endpoint was no longer returning XML but HTML. If the desired file actually has javascript, you might add CDATA tags around your js like so:
<script>
/* <![CDATA[ */
console.log(myJavaScriptCode < theBest);
/* ]]> */
</script>`

cvc-elt.1: Cannot find the declaration of element 'oauth-config'. [2]

I have started implementing Joauth authentication. Ofcourse, right now i am doing copy paste to learn how it works.
currently i am facing issue
"cvc-elt.1: Cannot find the declaration of element 'oauth-config'. [2]"
I have taken reference URL and that URL is beneath.
JOAuth, a java-based OAuth 1 (final) and OAuth 2 (draft 10) library. How do I use it?
oauth-config.xml code snippet
<?xml version="1.0" encoding="UTF-8" ?>
<oauth-config>
<!-- Twitter OAuth Config -->
<oauth name="twitter" version="1">
<consumer key="TWITTER_KEY" secret="TWITTER_SECRET" />
<provider requestTokenUrl="https://api.twitter.com/oauth/request_token" authorizationUrl="https://api.twitter.com/oauth/authorize" accessTokenUrl="https://api.twitter.com/oauth/access_token" />
</oauth>
<!-- Facebook OAuth -->
<oauth name="facebook" version="2">
<consumer key="APP_ID" secret="APP_SECRET" />
<provider authorizationUrl="https://graph.facebook.com/oauth/authorize" accessTokenUrl="https://graph.facebook.com/oauth/access_token" />
</oauth>
<service path="/request_token_ready" class="com.neurologic.music4point0.oauth.TwitterOAuthService" oauth="twitter">
<success path="/start.htm" />
</service>
<service path="/oauth_redirect" class="com.neurologic.music4point0.oauth.FacebookOAuthService" oauth="facebook">
<success path="/start.htm" />
</service>
</oauth-config>
can u help what is the wrong here? i think we need to add "dtd" file. can u please suggest me here. If you need any additional info, please suggest me.
Really appreciable,
Pradeep

voicexml output of the external grammar and refill the field element

I would like, that if the user says "help" that the following field doesn't get filled, and that the user gets all possible options.
<form id="test">
<field name="var1">
<prompt bargein="true" bargeintype="hotword" >say xy </prompt>
<grammar src = "grammar.grxml" type="application/srgs+xml" />
<filled>
<assign name="myProdukt" expr="var1" />
you said <value expr="myProdukt"/>
</filled>
</field>
(let's say in the external grammar is "p1", "p2" and "p3", the user says "help", and the systems says "p1","p2","p3" and the user can choose again - therefore the word "help" has to be in the external grammar as well, doesn't it?)
thanks in advance
Yes, the active grammar must contain a "help" utterance which returns the value 'help'. You then catch the event with a help tag:
<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="http://www.w3.org/2001/vxml" version="2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd">
<form id="test">
<field name="var1">
<prompt bargein="true" bargeintype="hotword" >say xy </prompt>
<grammar src = "grammar.grxml" type="application/srgs+xml" />
<filled>
<assign name="myProdukt" expr="var1" />
you said <value expr="myProdukt"/>
</filled>
<help>
To choose a product, say,
<!-- whatever the product choices are -->
frobinator, submarine, curling iron, ..
<reprompt/>
</help>
</field>
</form>
</vxml>
Alternatively, following the DRY principle, this effect can be done globally for your application with using an application root document containing a link element. In the example app-root.vxml document below, there is a linkbinding a global grammar "help" utterance to the help event :
<?xml version="1.0"?>
<vxml version="2.1" xmlns="http://www.w3.org/2001/vxml">
<link event="help">
<grammar mode="voice" root="root_rule" tag-format="semantics/1.0"
type="application/srgs+xml" version="1.0" xml:lang="en-US">
<rule id="root_rule" scope="public">
<one-of>
<item weight="1.0">
help
</item>
</one-of>
</rule>
</grammar>
</link>
</vxml>
This grammar will be active everywhere -- effectively merged with each active field grammar. If you need more information about application root documents, the section of the VoiceXML specification Executing a Multi-Document Application explains. Also see Handling Events from the Tellme Studio documentation
Then, in pages of your application, make reference to the application root document via the application attribute of the vxml element and speak appropriately in a help catch block:
<?xml version="1.0" encoding="UTF-8"?>
<vxml xmlns="http://www.w3.org/2001/vxml" version="2.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/2001/vxml http://www.w3.org/TR/voicexml20/vxml.xsd"
application="app-root.vxml">
<form id="test">
<field name="var1">
<prompt bargein="true" bargeintype="hotword" >say xy </prompt>
<grammar src = "grammar.grxml" type="application/srgs+xml" />
<filled>
<assign name="myProdukt" expr="var1" />
you said <value expr="myProdukt"/>
</filled>
<help>
To choose a product, say,
<!-- whatever the product choices are -->
frobinator, submarine, curling iron, ..
<reprompt/>
</help>
</field>
</form>
</vxml>
You could, of course, put the link code in the same page as your form, but it is likely you will want help active for every field of your application unless there is collision with something in a particular field's grammar.

Resources