Trying to grab information in Child Link using WebHarvest - webharvest

I would like to grab the information of each child link, but the program shows error. Below are my full config file. The error is Caused by: org.xml.sax.SAXParseException; lineNumber: 6; columnNumber: 724; Element type "t.length" must be followed by either attribute specifications, ">" or "/>".**
<?xml version="1.0" encoding="UTF-8"?>
<config>
<var-def name="webpage">
<html-to-xml>
<http url="http://www.thestar.com.my/business/" />
</html-to-xml>
</var-def>
<loop item="TheStarBiz" index="i">
<list>
<xpath expression="//div[#class='nine columns mobile3']">
<var name="webpage"></var>
</xpath>
</list>
<body>
<var-def name="title">
<xpath expression="(//p[#class='m'])/a/text()">
<var name="TheStarBiz"></var>
</xpath>
</var-def>
<var-def name="link">
<xpath expression="//p[#class='m']/a/#href">
<var name="TheStarBiz"></var>
</xpath>
</var-def>
<var-def name="new_url">
<xquery>
<xq-param name="TheStarBiz"><var name="TheStarBiz"/></xq-param>
<xq-expression><![CDATA[
declare variable $TheStarBiz as node() external;
let $url := data($TheStarBiz//p[#class='m']/a/#href)
return
$url
]]></xq-expression>
</xquery>
</var-def>
<var-def name="new_page_content">
<http url="${new_url}"/>
</var-def>
<var-def name="fulldesc">
<xpath expression="//div[#class='story']">
<var name="new_page_content"/>
</xpath>
</var-def>
<var-def name="textfile">
<file action="append" type="text" path="C:\Users\jacey\Desktop\WebHarvest\test.txt">
<template>
${title} ${sys.cr}${sys.lf}
${link} ${sys.cr}${sys.lf}
${new_page_content} ${sys.cr}${sys.lf}
</template>
</file>
</var-def>
</body>
</loop>
</config>

For those who come after:
I had almost the same error and it was caused by a snippet of javascript in the file being parsed:
blah...for(var o=0;o<t.length;o++)...blah
In hindsight I suppose it's kindof obvious. In our case, this was because the endpoint was no longer returning XML but HTML. If the desired file actually has javascript, you might add CDATA tags around your js like so:
<script>
/* <![CDATA[ */
console.log(myJavaScriptCode < theBest);
/* ]]> */
</script>`

Related

How to add a web view to Jira dashboard

We use Jira.. like many others.. but we also use a forum for our business discussions board and have been since before Jira existed, so we have a lot of historical information in there.
It is possible to add "Gadgets" to the dashboard, but is it possible to add a webview somewhere?
Follow this guide:
https://developer.atlassian.com/jiradev/jira-platform/guides/dashboards/tutorial-writing-gadgets-for-jira
Open src/main/resources/gadget.xml:
<?xml version="1.0" encoding="UTF-8" ?>
<Module>
<ModulePrefs title="__MSG_gadget.title__" directory_title="__MSG_gadget.title__"
description="__MSG_gadget.description__">
<Optional feature="gadget-directory">
<Param name="categories">
JIRA
</Param>
</Optional>
<Optional feature="atlassian.util" />
<Optional feature="auth-refresh" />
<Require feature="views" />
<Require feature="settitle"/>
<Require feature="oauthpopup" />
#oauth
<Locale messages="__ATLASSIAN_BASE_URL__/download/resources/jira-gadget-tutorial-plugin/i18n/ALL_ALL.xml"/>
</ModulePrefs>
<Content type="html" view="profile">
<!-- omitted for now -->
</Content>
</Module>
Did you see:
<Content type="html" view="profile">
<!-- omitted for now -->
</Content>
Just insert your frame here:
<Content type="html" view="profile">
<iframe src="your forum url">
</Content>

Webharvest crawler script not creating XML file

I'm hoping someone can point out my (probably stupid) problem with this script. I'm trying to crawl a website to get the posts on the site and to load this into an XML document. I have tried to combine a couple of example scripts - the crawler and nytimes examples.
The script runs without error, however only the <edublogs date="02.10.2015"></edublogs> tags are exported.
Thanks in advance for your help.
<?xml version="1.0" encoding="UTF-8"?>
<config charset="UTF-8">
<!-- set initial page -->
<var-def name="home"><<SNIPPED>></var-def>
<!-- define script functions and variables -->
<script><![CDATA[
/* checks if specified URL is valid for download */
boolean isValidUrl(String url) {
String urlSmall = url.toLowerCase();
return urlSmall.startsWith("http://<<SNIPPED>>/") || urlSmall.startsWith("https://<<SNIPPED>>/");
}
/* set of unvisited URLs */
Set unvisited = new HashSet();
unvisited.add(home);
/* pushes to web-harvest context initial set of unvisited pages */
SetContextVar("unvisitedVar", unvisited);
/* set of visited URLs */
Set visited = new HashSet();
]]></script>
<file action="write" path="posts${sys.date()}.xml" charset="UTF-8">
<template>
<![CDATA[ <allposts date="${sys.datetime("dd.MM.yyyy")}"> ]]>
</template>
<!-- loop while there are any unvisited links -->
<while condition="${unvisitedVar.toList().size() != 0}">
<loop item="currUrl">
<list>
<var name="unvisitedVar"/>
</list>
<body>
<empty>
<!-- Get page content -->
<var-def name="content">
<html-to-xml>
<http url="${currUrl}"/>
</html-to-xml>
</var-def>
<!-- Get variables -->
<xquery>
<xq-param name="doc">
<var name="content"/>
</xq-param>
<xq-expression><![CDATA[
declare variable $doc as node() external;
let $title := data($doc//h1)
let $text := data($doc//div[#class="post-entry"])
let $categories := data($doc//div[#class="post-data"])
return
<post>
<title>{data($title)}</title>
<url>$(currUrl)</url>
<text>{data($text)}</text>
<categories>{data($categories)}</categories>
</post>
]]></xq-expression>
</xquery>
<!-- adds current URL to the list of visited -->
<script><![CDATA[
visited.add(sys.fullUrl(home, currUrl));
Set newLinks = new HashSet();
]]></script>
<!-- loop through all collected links on the downloaded page -->
<loop item="currLink">
<list>
<xpath expression="//a/#href">
<var name="content"/>
</xpath>
</list>
<body>
<script><![CDATA[
String fullLink = sys.fullUrl(home, currLink);
fullLink = fullLink.replaceAll("#.*","");
if ( isValidUrl(fullLink.toString()) && !visited.contains(fullLink) && !unvisitedVar.toList().contains(fullLink) && !fullLink.endsWith(".png") ) {
newLinks.add(fullLink);
}
]]></script>
</body>
</loop>
</empty>
</body>
</loop>
<!-- unvisited link are now all the collected new links from downloaded pages -->
<script><![CDATA[
SetContextVar("unvisitedVar", newLinks);
]]></script>
</while>
<![CDATA[ </posts> ]]>
</file>
Its because your while doesnt RETURN anything. Most likely because you've surrounded the body with empty - which will force no results to be returned (see manual). It sets variables etc, but doesn't return anything to "console" for file to print.

Jira gadget is not working

I am trying to develop a gadget that will ultimately incorporate ChartJS, but I am having issues with the default gadget, as it does not load.
The code I am putting into the attlassian-plugin.xml is the following:
<atlassian-plugin key="${project.groupId}.${project.artifactId}" name="${project.name}" plugins-version="2">
<plugin-info>
<description>${project.description}</description>
<version>${project.version}</version>
<vendor name="${project.organization.name}" url="${project.organization.url}" />
<param name="plugin-icon">images/pluginIcon.png</param>
<param name="plugin-logo">images/pluginLogo.png</param>
</plugin-info>
<!-- add our i18n resource -->
<resource type="i18n" name="i18n" location="report"/>
<!-- add our web resources -->
<web-resource key="report-resources" name="report Web Resources">
<dependency>com.atlassian.auiplugin:ajs</dependency>
<resource type="download" name="report.css" location="/css/report.css"/>
<resource type="download" name="report.js" location="/js/report.js"/>
<resource type="download" name="images/" location="/images"/>
<context>report</context>
</web-resource>
<!-- publish our component -->
<component key="myPluginComponent" class="com.wfs.report.MyPluginComponentImpl" public="true">
<interface>com.wfs.report.MyPluginComponent</interface>
</component>
<!-- import from the product container -->
<component-import key="applicationProperties" interface="com.atlassian.sal.api.ApplicationProperties" />
<webwork1 key="demoaction" name="JTricks Demo Action" class="java.lang.Object">
<actions>
<action name="com.wfs.report.DemoAction" alias="DemoAction">
<view name="input">/templates/input.vm</view>
<view name="success">/templates/joy.vm</view>
<view name="error">/templates/tears.vm</view>
</action>
</actions>
</webwork1>
<atlassian-plugin name="Hello World" key="example.plugin.helloworld" plugins-version="2">
<plugin-info>
<description>A basic gadget module</description>
<vendor name="Atlassian Software Systems" url="http://www.atlassian.com"/>
<version>1.0</version>
</plugin-info>
<gadget key="unique-gadget-key" location="gadget.xml"/>
</atlassian-plugin>
</atlassian-plugin>
and my gadget.xml which i put in the resources directory is:
<?xml version="1.0" encoding="UTF-8" ?>
<Module>
<ModulePrefs title="JIRA Issues" author_email="adent#example.com" directory_title="JIRA Issues"
screenshot="images/screenshot.png"
thumbnail="images/thumbnail.png">
<Optional feature="dynamic-height" />
</ModulePrefs>
<Content type="html">
<![CDATA[
Hello, world!
]]>
</Content>
</Module>
</xml>
which I copied from https://developer.atlassian.com/display/GADGETS/Creating+your+Gadget+XML+Specification
yet I still get
It looks like you have one plugin descriptor nested inside another plugin descriptor. (I'm surprised that it actually passed validation!)
Change this:
<atlassian-plugin name="Hello World" key="example.plugin.helloworld" plugins-version="2">
<plugin-info>
<description>A basic gadget module</description>
<vendor name="Atlassian Software Systems" url="http://www.atlassian.com"/>
<version>1.0</version>
</plugin-info>
<gadget key="unique-gadget-key" location="gadget.xml"/>
</atlassian-plugin>
to just this:
<gadget key="unique-gadget-key" location="gadget.xml"/>

WebHarvest - Scrape data using authentication

I am using the WebHarvest tool to scrape web data from a few websites. I have gone through the examples, but was not able to find a way to authenticate in websites and then scrape data from them.
Can anyone please cite an example configuration to achieve web data scraping through authentication? How do I send the login parameters and then receive the home page content? Appreciate your help on this.
I just modified one example (http://web-harvest.sourceforge.net/samples.php?num=4) of Web Harvest and it is running fine with login credentials. You may get updated code and try:
<?xml version="1.0" encoding="UTF-8"?>
<config charset="ISO-8859-1">
<!-- sends post request with needed login information -->
<http method="post" url="http://www.nytimes.com/auth/login">
<http-param name="is_continue">true</http-param>
<http-param name="URI">http://</http-param>
<http-param name="OQ"></http-param>
<http-param name="OP"></http-param>
<http-param name="USERID">web-harvest</http-param>
<http-param name="PASSWORD">web-harvest</http-param>
</http>
<var-def name="startUrl">http://www.nytimes.com/pages/todayspaper/index.html</var-def>
<file action="write" path="D:/nytimes/nytimes${sys.date()}.xml" charset="UTF-8">
<template>
<![CDATA[ <newyourk_times date="${sys.datetime("dd.MM.yyyy")}"> ]]>
</template>
<loop item="articleUrl" index="i">
<!-- collects URLs of all articles from the front page -->
<list>
<xpath expression="//div[#class='story']">
<html-to-xml>
<http url="${startUrl}"/>
</html-to-xml>
</xpath>
</list>
<!-- downloads each article and extract data from it -->
<body>
<xquery>
<xq-param name="doc">
<var name="articleUrl"/>
</xq-param>
<xq-expression><![CDATA[
declare variable $doc as node() external;
$doc
]]></xq-expression>
</xquery>
</body>
</loop>
<![CDATA[ </newyourk_times> ]]>
</file>
</config>

xmltask error encoding

I´m trying to get the value of node in the xml file. I observed that value is wrong. I´m belive that problem is the encoding. Someaone can help me ? below is my code:
In xml File:
<?xml version="1.0" encoding="UTF-8" ?>
<projects>
<project>
<application>Padrão</application> <!-- The problem is the character ~ -->
<name>padrao</name>
<icon>c:\buffer</icon>
<market>br.com.tls.test</market>
</project>
</projects>
My ant code
<xmltask source="config.xml" encoding="UTF-8">
<call path="//project">
<param name="name" path="name/text()" />
<param name="market" path="market/text()" />
<param name="icon" path="icon/text()" />
<param name="application" path="application/text()" />
<actions>
<echo message="#{application}" />
<init-release name="#{name}" market="#{market}" icon="#{icon}" application="#{application}"/>
</actions>
</call>
</xmltask>
Result
[echo]: padr#o
expected
[echo]: padrão
Solution
I changing the file to UTF-8 and I´d sucess in the replace.
I haven't used xmltask, but echo task also has encoding property, have you tried setting that as well?
e.g. <echo message="#{application}" encoding="UTF-8" />

Resources