Convert from wiki to html - ruby-on-rails

I'm using a wikipedia api for getting info from wikipedia.
Is there anything for convert wiki text in html?
I've tried mediacloth but i doesn't works well

Take a look at marker.
>> require 'marker'
>> m = Marker.parse "== heading ==\nparagraph with '''bold''' text"
>> puts m.to_html
<h2>heading</h2>
<p>paragraph with <b>bold</b> text</p>

Try also wikicloth http://code.google.com/p/wikicloth/ it implements some things that others haven't like tables.

You could download a static HTML dump of Wikipedia.

Related

Carrying style IDs/names from HTML to .docx?

Is it possible to somehow tell pandoc to carry the names of styles from original HTML to .docx?
I understand that in order to tune the actual styles, I should be using reference.docx file generated by pandoc. However, reference.docx is limited to what styles it has to: headings, body text, block text, etc.
I'd like to:
specify "myStyle" style in the input HTML (via a "class" attribute, via any other HTML attribute or even via a filter code written in Lua),
<html>
<body>
<p>Hello</p>
<p class="myStyle">World!</p>
</body>
</html>
add a custom "myStyle" to reference.docx using Word,
run a html->docx conversion an expect pandoc generate a paragraph element with "myStyle" (instead of BodyText, which I believe it sets by default), so the end result looks like this (contents of word/document.xml inside the resulting output.docx was cut for brevity):
<w:p>
<w:pPr>
<w:pStyle w:val="BodyText" />
</w:pPr>
<w:r>
<w:txml:space="preserve">Hello</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="myStyle" />
</w:pPr>
<w:r>
<w:txml:space="preserve">World!</w:t>
</w:r>
</w:p>
There's some evidence styleId can be passed around, but I don't really understand it and am unable to find any documentation about it.
Doc on filtering in Lua states you can access attrs when manipulating a pandoc.div, but it says nothing about whether any of the attrs will be interpreted by pandoc in any meaningful way.
Finally, found what I needed – Custom styles. It's limited, but better than what I arrived earlier, and of course much better than nothing at all :)
I'll leave a step-by-step guide here in case anyone stumbles upon a similar question.
First, generate a reference.docx file like this:
pandoc --print-default-data-file reference.docx > styles.docx
Then open the file in MS Word (I was using a macOS version) you'll see this:
Click the "New style..." button on the right, and create a style to your liking. In my case I made change the style of text to be bold, in blue color:
Since I am converting from HTML to DOCX, here's my input.html:
<html>
<body>
<div>Page 1</div>
<div custom-style="eugene-is-testing">Page 2</div>
<div>Page 3</div>
</body>
</html>
Run:
pandoc --standalone --reference-doc styles.docx --output output.docx input.html
Finally, enjoy the result:

How do I escape HTML for th:errors?

Let's say I have:
<span th:if="${#fields.hasErrors('firstName')}" class="color--error" th:errors="*{firstName}"></span>
How do I escape the text if the error text contains HTML? I know for normal text, we can use th:utext.
As of 3.0.8-SNAPSHOT, Thymeleaf-Spring has th:uerrors.
See this GitHub issue for the discussion: https://github.com/thymeleaf/thymeleaf-spring/issues/153
And this change log for 3.0.8: http://forum.thymeleaf.org/Thymeleaf-3-0-8-JUST-PUBLISHED-td4030687.html
th:errors is just a shortcut. You still use th:utext for this, you just have to manually output your errors. In your case, the code could look something like:
<div th:if="${#fields.hasErrors('firstName')}" th:each="err: ${#fields.errors('firstName')}" th:utext="${err}" class="color--error" />

Nokogiri results different from brower inspect

I am trying to scrape a site but the results returned for just the links is different from when I inspect it with the browser.
In my browser I get normal links but all the a HREF links all become javascript:void(0); from Nokogiri.
Here is the site:
https://www.ctgoodjobs.hk/jobs/part-time
Here is my code:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
response = open(url) rescue nil
next unless response
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').text
is not that easy, urls are "obscured" using a js function, that's why you're getting javascript: void(0) when asking for the hrefs... looking at the html, there are some hidden inputs for each link, and, there is a preview url that you can use to build the job preview url (if that's what you're looking for), so you have this:
<div class="result-list-job current-view">
<input type="hidden" name="job_id" value="04375145">
<input type="hidden" name="each_job_title_url" value="barista-senior-barista-咖啡調配員">
<h2 class="job-title">Barista/ Senior Barista 咖 啡 調 配 員</h2>
<h3 class="job-company">PACIFIC COFFEE CO. LTD.</h3>
<div class="job-description">
<ul class="job-desc-list clearfix">
<li class="job-desc-loc job-desc-small-icon">-</li>
<li class="job-desc-work-exp">0-1 yr(s)</li>
<li class="job-desc-salary job-desc-small-icon">-</li>
<li class="job-desc-post-date">09/11/16</li>
</ul>
</div>
<a class="job-save-btn" title="save this job" style="display: inline;"> </a>
<div class="job-batch-apply"><span class="checkbox" style="background-position: 0px 0px;"></span><input type="checkbox" class="styled" name="job_checkbox" value="04375145"></div>
<div class="job-cat job-cat-de"></div>
</div>
then, you can retrieve each job_id from those inputs, like:
inputs = doc.search('//input[#name="job_id"]')
and then build the urls (i found the base url at joblist_preview.js:
urls = inputs.map do |input|
"https://www.ctgoodjobs.hk/english/jobdetails/details.asp?m_jobid=#{input['value']}&joblistmode=previewlist&ga_channel=ct"
end
Take the output of a browser and that of a tool like wget, curl or nokogiri and you will find the HTML the browser presents can differ drastically from the raw HTML.
Browsers these days can process DHTML, Nokogiri doesn't. You can only retrieve the raw HTML using something that lets you see the content without the browser, like the above mentioned tools, then compare that with what you see in a text editor, or what nokogiri shows you. Don't trust the browser - they're known to lie because they want to make you happy.
Here's a quick glimpse into what the raw HTML contains, generated using:
$ nokogiri "https://www.ctgoodjobs.hk/jobs/part-time"
Nokogiri dropped me into IRB:
Your document is stored in #doc...
Welcome to NOKOGIRI. You are using ruby 2.3.1p112 (2016-04-26 revision 54768) [x86_64-darwin15]. Have fun ;)
Counting the hits found by the selector returns:
>> #doc.search('.job-title > a').size
30
Displaying the text found shows:
>> #doc.search('.job-title > a').map(&:text)
[
[ 0] "嬰 兒 奶 粉 沖 調 機 - 兼 職 產 品 推 廣 員 Part Time Promoter (時 薪 高 達 HK$90, 另 設 銷 售 佣 金 )",
...
[29] "Customer Services Representative (Part-time)"
]
Looking at the actual href:
>> #doc.search('.job-title > a').map{ |n| n['href'] }
[
[ 0] "javascript:void(0);",
...
[29] "javascript:void(0);"
]
You can tell the HTML doesn't contain anything but what Nokogiri is telling you, so the browser is post-processing the HTML, processing the DHTML and modifying the page you see if you use something to look at the HTML. So, the short fix is, don't trust the browser if you want to know what the server sends to you.
This is why scraping isn't very reliable and you should use an API if at all possible. If you can't, then you're going to have to roll up your sleeves and dig into the JavaScript and manually interpret what it's doing, then retrieve the data and parse it into something useful.
Your code can be cleaned up and simplified. I'd write it much more simply as:
url = "https://www.ctgoodjobs.hk/jobs/part-time"
doc = Nokogiri::HTML(open(url))
links = doc.search('.job-title > a').map(&:text)
The use of search(...).text is a big mistake. text, when applied to a NodeSet, will concatenate the text of each contained node, making it extremely difficult to retrieve the individual text. Consider this:
require 'nokogiri'
doc = Nokogiri::HTML(<<EOT)
<html>
<body>
<p>foo</p>
<p>bar</p>
</body>
</html>
EOT
doc.search('p').class # => Nokogiri::XML::NodeSet
doc.search('p').text # => "foobar"
doc.search('p').map(&:text) # => ["foo", "bar"]
The first result foobar would require being split apart to be useful, and unless you have special knowledge of the content, trying to figure out how to do it will be a major pain.
Instead, use map to iterate through the elements and apply &:text to each one, returning an array of each element's text.
See "How to avoid joining all text from Nodes when scraping" and "Taking apart a DHTML page" also.

google translate misses up the coding of my file

i am trying to use google translate for localization of an XML file, it has near 350K lines, but some of them contain coding for in-game font size and color, like so:
<replacement><p horizontalalignment="center"><br/><image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/><br/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Six_Superior" scalerate="1.5"/><image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/><br/><image enablescale="true" imagesetpath="00009499.Field_Boss" scalerate="1.4"/>Хмельной лик<br/><br/></p>Уничтожить зараженных насекомых<br/>возле мест обитания их королевы。<br/></replacement>
now for god knows what reason, google translate alters that code in the process of translation into some unacceptable coding, like so:
<replacement> <p horizontalalignment="center"> <br/> <image enablescale="false" imagesetpath="00015590.InterD_Jeryoung_3"/> <br/> <image enablescale = "true "imagesetpath =" 00015590.Tag_Dungeon_Six_Superior "scalerate =" 1.5 "/> <image enablescale="true" imagesetpath="00015590.Tag_Dungeon_Four_Superior" scalerate="1.5"/> <br/> <image enablescale = "true" imagesetpath = "00009499.Field_Boss" scalerate = "1.4" /> Intoxicated face <br/> <br/> </ p> Destroy infected insects <br/> habitats near their queen. <br/> </ replacement>
is there any way to avoid that, why is it happening exactly? anyhelp is appreciated on that matter,thanks
EDIT : i am also looking for a way to input my text and have it out in the same exact language with only the coding mishaps changing, so i can isolate those,build a comparison table and then use that to fix the errors after the actual translation is done, but i don't see a way for selecting the same language as input AND output in google translate, it always forces me choose a different one in input or output, kind of makes sense but if there is a way to do that, i might be able to work around it..
Do not feed Google translate with your Xml file, as far as I know it doesn't understand Xml.
Extract the text from the Xml file.
Feed the text to translate.
Transform the text back to Xml.
You could simply transform the Xml to a text document with a single line per Xml element so it would be easier to turn it back into Xml.
More detail
According to the Toolkit you can upload:
HTML (.HTML)
Microsoft Word (.DOC/.DOCX)
OpenDocument Text (.ODT)
Plain Text (.TXT)
Rich Text (.RTF)
Wikipedia URLs
And a couple of extras such as JSON. So no Xml.
The best way I see is to transform your Xml document into one of these types (I would probably use JSON) and transform it is such a way that it can easily be transformed back again by using either position (1 line in the text file is the first element in the Xml document) or by an id (add the Id or position of the element in the xml hierarchy to the JSON element)
My guess is that the toolkit recognizes the html tags in the xml and escapes them. So another option might be to un-escape the > to > and &lt to <

MODX: Snippet strips and hangs string when parsing the vars

i have a snippet call like this:
[!mysnippet?&content=`[*content*]` !]
What happen is that, if i send some html like this:
[!mysnippet?&content=`<p color='red'>Yeah</p>` !]
it will return this:
<p colo
the [test only] snippet code (mysnippet) is:
<?php
return $content;
?>
Why is this happening?
My actual snippet is converting html to pdf, so i really need this.
Thank you all ;D
EDIT: I'm using Modx Evo 1.0.2
MODx Evolution has a limitation whereby you can't use "=" (equals signs) in Snippet parameter values. Best solution is to place the content in a chunk or TV and then call it. This is not an issue in MODx Revolution.

Resources