What is the structure of wikipedia dumps? - xml-parsing

I need the list of Hungarian words for a project and the only possible source I found is wikipedia XML dumps. They are really big, I guess I could parse them with a read stream and a SAX parser, but it would be nice to know more about the structure so I could test the code on a small example before running it on the big files. Is there a description somewhere about what structure they use and what the different XML gzip files contain? https://dumps.wikimedia.org/enwiki/latest/ https://dumps.wikimedia.org/huwiki/latest/

The format is documented here: https://www.mediawiki.org/wiki/Help:Export It looks like this:
<mediawiki xml:lang="en">
<page>
<title>Page title</title>
<restrictions>edit=sysop:move=sysop</restrictions>
<revision>
<timestamp>2001-01-15T13:15:00Z</timestamp>
<contributor><username>Foobar</username></contributor>
<comment>I have just one thing to say!</comment>
<text>A bunch of [[Special:MyLanguage/text|text]] here.</text>
<minor />
</revision>
<revision>
<timestamp>2001-01-15T13:10:27Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>new!</comment>
<text>An earlier [[Special:MyLanguage/revision|revision]].</text>
</revision>
</page>
<page>
<title>Talk:Page title</title>
<revision>
<timestamp>2001-01-15T14:03:00Z</timestamp>
<contributor><ip>10.0.0.2</ip></contributor>
<comment>hey</comment>
<text>WHYD YOU LOCK PAGE??!!! i was editing that jerk</text>
</revision>
</page>
</mediawiki>

Related

Getting errors in Saxon-HE 9.9.1 when processing DITA: I/O error on DTD

Using Saxon 9.9.1.3J, I am getting an I/O error every time I try to transform a DITA file that has a DTD:
I/O error reported by XML parser processing file:/test.dita: /learningAssessment.dtd (No such file or directory)
This happens even if I force -dtd:off on the command line. Commenting out the DTD in the DITA file does allow it to process.
Interestingly, when I run the same DITA file in oXygen using Saxon-HE 9.8.0.12, it does process correctly. Any idea what might be causing this to behave differently?
Sample DITA file:
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE learningAssessment PUBLIC "-//OASIS//DTD DITA Learning Assessment//EN" "learningAssessment.dtd">
<learningAssessment id="id">
<title>Title</title>
<learningAssessmentbody>
<lcInteraction>
<lcSingleSelect id="lcSingleSelect_agy_fxz_ljb">
<lcQuestion>Question</lcQuestion>
<lcAnswerOptionGroup id="lcAnswerOptionGroup_bgy_fxz_ljb">
<lcAnswerOption>
<lcAnswerContent>A</lcAnswerContent>
</lcAnswerOption>
<lcAnswerOption>
<lcAnswerContent>B</lcAnswerContent>
<lcCorrectResponse/>
</lcAnswerOption>
</lcAnswerOptionGroup>
</lcSingleSelect>
</lcInteraction>
</learningAssessmentbody>
</learningAssessment>
And here's a shell of an XSL that demonstrates the error:
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="3.0">
<xsl:output />
<xsl:template match="/">
<xsl:apply-templates/>
</xsl:template>
</xsl:stylesheet>
You can resolve the problem by the following steps:
Download DITA-OT and expand it any folder you like. In my case it is located at D:\DITA-OT\dita-ot-3.3.4.
Set CLASSPATH environment variable to contain saxon9he.jarand xml-resolver-1.2.jar in DITA-OT/lib.
Invoke Saxon by specifying class name net.sf.saxon.Transform and the catalog: paramter that specifies [DITA-OT]/catalog-dita.xml.
Here is execution example command window:
Hope this helps!
My guess is that you have somehow contrived to give the document a base URI of "file:/test.dita: ", including the final space. You haven't shown how you are running the transformation, so we can't tell where this base URI comes from.
The option -dtd:off is a little misleading. It doesn't switch off DTD processing, only DTD-based validation, which is just one aspect of DTD processing. An XSLT processor always needs to ask the XML parser to read the DTD in order to expand any entity references.
(Well, theoretically it could delay reading any external DTD until it finds the first entity reference; but sadly, I don't know of any XML parser that does that.)
I misunderstood how DTDs work. I assumed the public ones were loaded from an HTTP URL, but they need to be local files. Loading the catalog for DITA OT resolved the issue.
transform -s:test.dita -xsl:test.xsl -o:test.html -catalog:/org.oasis-open.dita.v1_2/plugins/org.oasis-open.dita.v1_2/catalog.xml
Where the catalog option points to this file on my local filesystem, which comes from DITA OT

Convert Nokogiri XML Document into Array of Strings?

I'm creating a Ruby on Rails application and using Nokogiri to parse an XML file. I'm trying to parse the XML file into mutable strings which I can manipulate to create other content.
Here's a sample XML I'm using
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
</feed>
This is what I've done so far relating to my problem
In my controller -
def index
#blog_title, #blog_post = parse_xml
end
private
def parse_xml
#xml_doc = Nokogiri::XML(open("atom.xml"))
titles = #xml_doc.css("entry title")
post = #xml_doc.css("content")
return titles, post
end
In my view -
<% for i in 1..#blog_title.length %>
<li><%= #blog_title[i-1] %></li>
<li><%= #blog_post[i-1] %></li>
<% end %>
A sample output from the view (it returns a Nokogiri Element) -
<title type="html"><![CDATA[First Post!]]></title>
So ideally, I'd like to make all the Nokogiri::Element inside the Nokogiri::Document a string or make the entire array a String array.
I've tried iterating through each element and calling .to_s but it doesn't seem to work.
I've also tried calling Ruby::String methods such as slice and that doesn't work (for obvious reasons).
The end result I'm trying to get at (using the sample output on my view) is to return only the following and none of the rest.
First Post!
Can anyone help me? If I'm not clear enough or if someone needs to see more work, please feel free to ask!
For your case you should simply use .text to extract the content of tags. Something like titles.text would work.
You're dealing with RSS/Atom feeds which can contain multiple title tags. You need to iterate over all title nodes and extract their content separately, in a way that lets you keep track of their order and what article they're attached to:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
</feed>
EOT
doc.search('title').map(&:text)
# => ["\n First Post! \n "]
This returns an array of the text inside the title nodes. From there you can easily clean up each string, manipulate them, reuse them, whatever.
doc.search('title').map{ |s| s.text.strip }
# => ["First Post!"]
search returns a NodeSet, which is akin to an array of title nodes found in the document. If you don't iterate over them you'll get a concatenated string containing all their text, which is usually NOT what you want:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<foo>
<title>this</title>
<title>is</title>
<title>what</title>
<title>you'd</title>
<title>get</title>
</foo>
EOT
doc.search('title').text
# => "thisiswhatyou'dget"
versus:
doc.search('title').map(&:text)
# => ["this", "is", "what", "you'd", "get"]
Trying to tear apart the first result is impossible unless you have prior knowledge of the document's structure which is usually not true. Iterating over the returned NodeSet will yield very usable results.
To maintain consistency with the various title tags in a feed, you need to loop over the entries, then extract the embedded titles which is a bit different than what your sample XML and code shows:
require 'nokogiri'
doc = Nokogiri::XML(<<EOT)
<feed xmlns="http://www.w3.org/2005/Atom">
<entry>
<title type="html">
<![CDATA[ First Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>I’m very excited to have finally got my site up and running along with this blog!</p>]]>
</content>
</entry>
<entry>
<title type="html">
<![CDATA[ Second Post! ]]>
</title>
<content type="html">
<![CDATA[
<p>blah</p>]]>
</content>
</entry>
</feed>
EOT
titles = doc.search('entry').map { |entry|
entry.at('title').text.strip
}
titles # => ["First Post!", "Second Post!"]
Or perhaps more usable:
titles_and_content = doc.search('entry').map { |entry|
[
entry.at('title').text.strip,
entry.at('content').text.strip
]
}
titles_and_content
# => [["First Post!",
# "<p>I’m very excited to have finally got my site up and running along with this blog!</p>"],
# ["Second Post!", "<p>blah</p>"]]
which returns the title and the content for each entry. From this you can easily build up code to extract the links to the articles, date of publishing, refresh-rates, original site, everything you'd want to know about an individual article and its source, then store it in a database for later regurgitation when requested.
There are gems and scripts available for processing RDF, RSS and Atom feeds, however, years ago, when I had to write a huge aggregator for feeds, nothing was available that met my needs and I wrote one from scratch. I'd recommend trying to find one rather than reinvent that wheel, otherwise look through their source and learn from their experience. There are a number of things to do in code to be a good network-citizen that doesn't swamp the servers and get you banned.
See "How to avoid joining all text from Nodes when scraping" also.

How to extend felogin's locallang?

How is it possible to add translations of strings to the felogin plugin? I slowly start to get the convention for templates (directing to the modified templates in the plugin's typoscript configuration) but that does not work with the locallang. The original messages are in English in the xlf format, located in the plugin's folder. I know this can be done in TypoScript but I do not like to have the strings defined so incosistently. (I guess modifying that original file is not the proper way.)
Overriding labels by TypoScript is the way to go. Manually editing the l10n files is a really bad idea - these files are overwritten on updating the translations. If the extension gets an update and new labels are added, you will want to perform the updates.
The change from XML files for translation to the XLIFF format didn't change anything in the best practise you should use for adjusting labels according to your needs. It's just another format with a standardized translation server (Pootle) that (in theory) allows some special features like e.g. plural forms.
Conclusion: Use TypoScript.
For the default language (no config.language set) use:
plugin.tx_felogin_pi1._LOCAL_LANG.default {
key = value
}
For a specific language, e.g. German, use
plugin.tx_felogin_pi1._LOCAL_LANG.de {
key = value
}
The best way is to use the following code:
ext_tables.php, e.g. of your theme extension with
$GLOBALS['TYPO3_CONF_VARS']['SYS']['locallangXMLOverride']['EXT:felogin/Resources/Private/Language/locallang.xlf'][] = 'EXT:theme/Resources/Private/Language/locallang_felogin.xlf';
and in this file you can use something like
<?xml version="1.0" encoding="utf-8" standalone="yes" ?>
<xliff version="1.0">
<file source-language="en" datatype="plaintext" original="messages" date="2015-06-30T21:14:27Z">
<header>
<description>Language labels for felogin</description>
</header>
<body>
<trans-unit id="permalogin">
<source>An diesem Rechner angemeldet bleiben</source>
</trans-unit>
<trans-unit id="ll_forgot_header">
<source>Passwort vergessen?</source>
</trans-unit>
<trans-unit id="ll_welcome_header">
<source>Sie betreten den Händlerbereich</source>
</trans-unit>
<trans-unit id="ll_welcome_message">
<source>Bitte loggen Sie sich ein.</source>
</trans-unit>
</body>
</file>
</xliff>
You can use the BE language module to download preconfigured locallangs for every language you need.
The files are stored in e.g. typo3conf/l10n/de/fe_login.
You can edit these files manually in l10n to get your own strings inside or use an extension like snowbabal to do the editing inside a BE module.

How to get the language specific messages using google closure template

Am trying to implement internationalization support to my project for this people suggested google Closure Templates.but am very new to closure templates.am trying to get the language specific messages using closure template but am not getting in xlf file.If any one knows how to generate language specific messages using closure template, please tell me the steps.that's great help to me.
My .soy file code as bellow.
{namespace poc}
/**
*Testing message translation
*#param pageTitle
*/
{template .translate}
<HTML>
<Head>
<title>{$pageTitle}
</title>
</head>
<div>
{msg desc="Hello"}Hello{/msg}
</div>
</html>
{/template}
and generated .xlf content as bellow
<?xml version="1.0" encoding="UTF-8"?>
<xliff version="1.2" xmlns="urn:oasis:names:tc:xliff:document:1.2">
<file original="SoyMsgBundle" datatype="x-soy-msg-bundle" xml:space="preserve" source-language="en" target-language="pt-BR">
<body>
<trans-unit id="2286494898080570401" datatype="html">
<source>Thanks</source>
<target/>
<note priority="1" from="description">Says thanks</note>
</trans-unit>
</body>
</file>
</xliff>
I see you already used the SoyMsgExtractor to create the base xlf. Next you need to make translations of this base xlf to the languages you want to support. A file for each language is created. I used the xliff exitor from Translution. http://sourceforge.net/projects/eviltrans.
Next, using the SoyToJsSrcCompiler a translation soy can be made per language:
java -jar SoyToJsSrcCompiler.jar --shouldGenerateGoogMsgDefs --bidiGlobalDir 1 --messageFilePathFormat Filename_en-us.xliff --outputPathFormat FileName_fr.js *.soy
This will create a Filename._fr.js file that contains the compiled soy file.
Including this file instead of the original soy (or compiled) will create a localized version.
Good luck!
\Rene
i think the easiest way is to make (i.e. generate from whatever source) a separate js file which contains one messages object and reference it through an extern declared function.
it justs works and has no complicated dependencies.

Include static HTML (invalid XHTML) file to JSF Facelets

I have the following problem, We have web content manager (WCM) running at remote host,
which is responsible for generating header and footer HTML files.
i.e. header.html, footer.html.
The HTML files are not properly formatted syntax wise,
WCM generated files have
Space character ( ) 🡢 it is not allowed in XHTML.
Non Closing break line (<br>) tags 🡢 it is invalid in XHTML.
So the WCM generated HTML pages might not be valid XHTML pages.
We are implementing some of our applications in JSF,
where we need to include the WCM generated header and footer files.
Can we include the non-formatted HTML files into our XHTML files?
commonTemplate.xhtml
<html>
<head>
..........;
</head>
<body>
<ui:include src="remote_host/header.html" />
<ui:insert name="commonBodyContent" />
<ui:include src="remote_host/footer.html" />
</body>
</html>
I guess it is related to this question: Include non-Facelet content in a Facelet template
I do not recommend to mix XHTML with HTML, but most probably the browsers will not have any issues with the mentioned characters, hence you might try to directly render the file contenty, e.g. by
<h:outputText value="#{yourBean.headerCode}" escape="false" />
Whereas YourBean.getHeaderCode() would readout the header file's content and return it as String. YourBean should be ApplicationScoped.
Faster and better would be to get the WCM generating valid XHTML.

Resources