I'm running into a strange problem with nokogiri and xpath. I want to parse a HTML document and get all links by href value and the anchor text they contain.
Here's my xpath so far:
xpath = "//a[contains(text(), #{link['anchor_text']}) and #href='#{link['target_url']}']"
a = doc.search(xpath)
This works fine so far as long as link['anchor_text'] is a string without numbers.
If I'm trying to get a link with the anchor text "11example" it throws the following error:
Invalid expression: //a[contains(text(), 11example) and #href='http://www.example.com/']
Maybe it's just a stupid mistake, but I'm not seeing why this error occurs. If I put some quotes around the #{link['anchor_text']} in the xpath, nothing is working.
Edit: Here's the sample HTML:
<!DOCTYPE html>
<head>
<title>Example.com</title>
</head>
<body>
<p>
<strong>Here is some text</strong><br />
11exampleSome text here and there
</p>
<p>
<strong>Another text</strong><br />
example.comSome text here and there
</p>
</body>
Edit2: If I run these queries manually in irb console everything works as expected, but only if I put the text in quotes.
Thanks in advance!
Kind regards,
madhippie
The simple answer is that you are missing quotes around #{link['anchor_text']}, like you have around #{link['target_url']}. The full XPath should be
xpath = "//a[contains(text(), '#{link['anchor_text']}') and #href='#{link['target_url']}']"
The reason it appears to work (at least not produce an error) when you don’t start with a number is that the string is being interpreted as a node query. For example Nokogiri is looking for a tag named <example.com> inside the <a> tag, then converting it to a string and seeing if the text nodes of the <a> tag contain that string. If the tag isn’t there (as in this case) then the result of contains is always true.
As a demonstration, with the HTML:
<q>foo</q>example
<q>foo</q>foo
foo
Then the query
doc.search("//a[contains(text(), q)]")
doesn’t match the first <a> tag, but does match the second and third.
When the string starts with a number, it can’t be parsed into a node query since names starting with digits aren’t valid XML (or HTML) element names, so you get an error.
Related
Is it possible to somehow tell pandoc to carry the names of styles from original HTML to .docx?
I understand that in order to tune the actual styles, I should be using reference.docx file generated by pandoc. However, reference.docx is limited to what styles it has to: headings, body text, block text, etc.
I'd like to:
specify "myStyle" style in the input HTML (via a "class" attribute, via any other HTML attribute or even via a filter code written in Lua),
<html>
<body>
<p>Hello</p>
<p class="myStyle">World!</p>
</body>
</html>
add a custom "myStyle" to reference.docx using Word,
run a html->docx conversion an expect pandoc generate a paragraph element with "myStyle" (instead of BodyText, which I believe it sets by default), so the end result looks like this (contents of word/document.xml inside the resulting output.docx was cut for brevity):
<w:p>
<w:pPr>
<w:pStyle w:val="BodyText" />
</w:pPr>
<w:r>
<w:txml:space="preserve">Hello</w:t>
</w:r>
</w:p>
<w:p>
<w:pPr>
<w:pStyle w:val="myStyle" />
</w:pPr>
<w:r>
<w:txml:space="preserve">World!</w:t>
</w:r>
</w:p>
There's some evidence styleId can be passed around, but I don't really understand it and am unable to find any documentation about it.
Doc on filtering in Lua states you can access attrs when manipulating a pandoc.div, but it says nothing about whether any of the attrs will be interpreted by pandoc in any meaningful way.
Finally, found what I needed – Custom styles. It's limited, but better than what I arrived earlier, and of course much better than nothing at all :)
I'll leave a step-by-step guide here in case anyone stumbles upon a similar question.
First, generate a reference.docx file like this:
pandoc --print-default-data-file reference.docx > styles.docx
Then open the file in MS Word (I was using a macOS version) you'll see this:
Click the "New style..." button on the right, and create a style to your liking. In my case I made change the style of text to be bold, in blue color:
Since I am converting from HTML to DOCX, here's my input.html:
<html>
<body>
<div>Page 1</div>
<div custom-style="eugene-is-testing">Page 2</div>
<div>Page 3</div>
</body>
</html>
Run:
pandoc --standalone --reference-doc styles.docx --output output.docx input.html
Finally, enjoy the result:
I tried to parse SEC company filings from sec.gov. Starting from fb 10-Q index.htm let's look at a complete text submission filing like complete submission text filing. It has a structure like:
<SEC-DOCUMENT>
<SEC-HEADER>
<ACCEPTANCE-DATETIME>"some content" This tag is not closed.
"some lines resembling yaml markup"
These are indented lines with a
"key": "value" structure.
</SEC-HEADER>
<DOCUMENT>
.
.
some content
.
.
</DOCUMENT>
"several DOCUMENT tags" ...
</SEC-DOCUMENT>
I tried to figure out the structure of the <SEC-HEADER> tag and found some information under Public Dissemination
Service (PDS) Technical
Specification (pdf) and concluded that the content of the header should be SGML.
Nevertheless, I am clueless about the formatting, since there are no angle brackets, and the keys - value paires are separated by colons like key: value instead of <key>value</key>. In the pdf link I could not find anything about colons.
Question: Is the <SEC-HEADER> tag valid SGML? If it is, how to parse it?
I'd be glad at any help.
The short answer is no. The <SEC-HEADER> tag in the raw filing is not a valid SGML.
However, it is my understanding that this section in the raw filing is parsed automatically from the header file <accession_num>.hdr.sgml, which does follow SGML. This header file can be found in the same directory as the raw filing (i.e., the <accession_num>.txt file).
I use a REGEX of the form: ^<(.+?)>(.+?)$ (with re.MULTILINE option) to capture each (tag, value) tuple and get the results directly in a dict().
I believe the only tag in that file that has a closing tag is the </FILER> tag, where there could be multiple filers in each filing. You can first extract those using a REGEX of the form: <FILER>(.+?)</FILER> and then employ the same REGEX as above to get the inner tags for each filer.
Note that other than 'FILER', there could be other tags, representing different relations of the entities to the filing. Those are 'ISSUER', 'SUBJECT COMPANY', 'FILED BY', 'FILED FOR', 'SERIAL COMPANY', 'REPORTING OWNER'.
Let's say I have:
<span th:if="${#fields.hasErrors('firstName')}" class="color--error" th:errors="*{firstName}"></span>
How do I escape the text if the error text contains HTML? I know for normal text, we can use th:utext.
As of 3.0.8-SNAPSHOT, Thymeleaf-Spring has th:uerrors.
See this GitHub issue for the discussion: https://github.com/thymeleaf/thymeleaf-spring/issues/153
And this change log for 3.0.8: http://forum.thymeleaf.org/Thymeleaf-3-0-8-JUST-PUBLISHED-td4030687.html
th:errors is just a shortcut. You still use th:utext for this, you just have to manually output your errors. In your case, the code could look something like:
<div th:if="${#fields.hasErrors('firstName')}" th:each="err: ${#fields.errors('firstName')}" th:utext="${err}" class="color--error" />
I'm making a reusable package and in order to get the client side to work both with straight javascript and module loaders I have a code paths that requires me to document.write out script tags.
In my razor view I have something like this:
<script>
...
document.write([
'<script type="text/javascript" src="~/Oaf/SlimHeader/Media/Scripts/jquery-1.9.1.min.js"></script>',
'<script type="text/javascript" src="~/Oaf/SlimHeader/Media/Scripts/jquery-migrate-1.2.1.min.js"></script>',
].join('\n'))
...
</script>
Which Razor refuses to interpret in html mode:
Parser Error Message: Unterminated string literal. Strings that start
with a quotation mark (") must be terminated before the end of the
line. However, strings that start with # and a quotation mark (#")
can span multiple lines.
indicating the error is in the first script tag. This is javascript, I don't want Razor involved at all! (Ok, it would be nice if it parsed the ~ but honestly I can take care of that myself).
I've tried prefixing every line with #: and surrounding the whole thing in #" ... "# but neither seems to work.
This is not a razor issue, this code is invalid even in a simple HTML file, and will cause problems in the browser.
The solution is to:
var a = '<script><' +' /script>';
The bug has been closed as by design.
Thanks to Aron who got me to pare this down thereby prompting me to discover the answer.
Pared down the broken code looked like this (I hadn't included the if in the question):
#if (true) {
<script type="text/javascript">
var a = '<script></script>';
</script>
}
something in the interplay between the #if and the <script> tag in a sting just does not sit well. If I force text mode on each line inside the if by prefixing with #: then it works.
In the original question the solution it to prefix every line inside the Razor block with #:. Surrounding in a <text> block will not work. If you don't prefix every line with #: then you will get a parsing error very possibly for a line that was prefixed.
Seems like a bug with Razor. Will report it.
<?xml version="1.0" encoding="utf-8"?>
<LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"
android:orientation=”vertical”
android:layout_width=”fill_parent”
android:layout_height=”fill_parent” >
I get these two errors
error: Error parsing XML: not well-formed (invalid token)
&
Open quote is expected for attribute "android:orientation" associated with an element type "LinearLayout".
Did you copy and paste that from word? Your quotes look a little funky. Sometimes word will use a different character than the expected " for double quotes. Make sure those are all consistent. Otherwise, the syntax is invalid.
Looks like you have "smart quotes" ( not simple " double quotes) around some attributes in your LinearLayout element.
There are many references that explain the differences between valid and well formed XML documents. A good starting point can be found here. There is also an online XML Validator that you can use to test XML documents.
The validator shows that you have two issues:
Some of your attribute values use an invalid quote character: ” vs. ", and
you need to close the LinearLayout tag with /> instead of just >.