Why is lxml closing this "ol" tag when parsing? - html-parsing

Here is some HTML:
<ol><ul><li>item</li></ul></ol>
and some python 3 code with lxml to parse it and re-print it:
import sys
from lxml import etree, html
document_root = html.fromstring(sys.stdin.read())
print(etree.tostring(document_root, encoding='unicode'))
Here is the output:
<div><ol/><ul><li>item</li></ul>
</div>
In the output, lxml closes the ol before the ul starts, which changes the list structure.
Why is it doing that?
Can I get lxml to parse HTML in such a way as to preserve the list structure?
EDIT: NOTE that this example parses fine if I replace ul with ol (<ol><ol><li>item</li></ol></ol>), or if I replace ol with ul (<ul><ul><li>item</li></ul></ul>). The output is unchanged from the input.
I don't have control over the HTML, it could come from anywhere.
I'm using lxml 4.6.3, installed from PyPi, and python 3.9.
OR, is there another way to parse HTML in a way that I can pull list text out of it preserving the list structure in Python?
Just so you know, I'm using lxml to drop attributes, so below is code that is closer to my use case. However, I wanted to give the smallest reproducible test case first.
Code closer to my use case:
import sys
import lxml.html.clean as clean
from lxml import etree, html
document_root = html.fromstring(sys.stdin.read())
cleaner = clean.Cleaner(safe_attrs_only=True, safe_attrs=frozenset())
cleansed = cleaner.clean_html(document_root)
# Do something with the lists in cleansed, defined by ol, ul, and li ..
print(etree.tostring(cleansed, encoding='unicode')

I think neither HTML 4 nor HTML5 allows an ul element as a child of an ol element. Only li elements can be direct children.
That might be why an HTML parser builds a tree structure not representing the nesting you have in your input markup. Whether a "traditional" HTML 4 parser, like probably implemented in lxml's/libxml's HTML parser algorithm, did the same change to the structure is something I don't remember and I am not sure where to test it.
While two HTML5 validators flag your ul as a not-allowed child of ol, current browsers seem to preserve that nesting.

Related

Does tag-it work on textarea HTML element?

I am using tag-it for my application. My requirement is that I should create tag-it tags and put these tags in a textarea element. I am choosing textarea over input because textarea can support newline.
I have gone through http://aehlke.github.io/tag-it/examples.html, but I notice that it supports only input element. I have also played around with the code and noticed the same.
Does tag-it work on textarea HTML element?
Looking forward to your response
The tag-it library works on input elements. One would, perhaps, have to modify their library in order to accommodate the textarea element.
Another alternative for tag-it would be to use tageditor library https://goodies.pixabay.com/jquery/tag-editor/demo.html as they support textarea HTML elements as well

iMacros get the ID of a div, not the content

I am trying to learn iMacros (and avoid jscript or vbscript IF possible). I was reading any resource i could find since yesterday and the imacros reference does not have any helpful example of what i need.
All the methods I tried, will extract either the TXT or the HTM content of an element. My problem is that i have a div like this
<div class="cust_div" id="Customer_45621">
...content in here...
</div>
And the part i need to extract is 45621 which is the only dynamic part of the id attribute.
For example, between 3 customers, it could be
Customer_45621
Customer_35123
Customer_85663
All I need is the number. Thanks.
The solution is
TAG POS=1 TYPE=DIV ATTR=cust_div EXTRACT=HTM
Then you have to use EVAL and use in it JS scripting to extract the id. That is the only way. You can't cut the HTML code without JS, but you can use JS in iMacros with EVAL.

What is the recommended way to pretty print HTML or code excerpts in AngularDart

What is the recommended way to pretty print HTML or code excerpts in AngularDart? Is there a package to help achieve this (have found none), or do developers simply use "external" packages like google-code-prettify?
I use this http://craig.is/making/rainbows/
You add a javascript tag and some classes to your tags containing your code - that's it.
If you want to include HTML including Angular markup you can use ng-non-bindable to prevent Angular processing tags and attributes it may have selectors for.

Prevent AngularJS interpolation in certain DOM hierarchies

I use AngularJS and have some parts of HTML that I don't wish to interpolate because it contains user inputted data. So potentially the data may have {{asdf}} in there that I don't want AngularJS to parse. This is because if the user inputs {{{}, {}} this may break the compilation process and prevent any Angular code from running.
Is there a way around this by specifying to Angular not to compile this part of the DOM tree?
The Non Bindable directive I believe is what your looking for.
So at any element you can do:
<div ng-non-bindable> Some {{1+2}} expressions</div>
That will display:
Some {{1+2}} expressions

Slim lang inline lists not working

I'm converting some existing HTML files to Slim (https://github.com/stonean/slim) and using it for the first time but I'm having problems getting lists to work in compact form (meaning all on one line rather than indented below). The docs say:
Inline tags
Sometimes you may want to be a little more compact and inline the
tags.
ul
li.first: a href="/a" A link
li: a href="/b" B link
But when I try that I get this output in the browser:
a href="/b" B
With the rendered HTML looking like this in the source:
<li:>a href="/b" B link</li:>
Any ideas why this isn't working and how to fix it?
Your syntax is correct and the output for me (slim 1.3.0) is, as expected:
<ul><li class="first">A link</li><li>B link</li></ul>
You should check your slim version and update appropriately.

Resources