Regex to ignore specific symbols together - html-parsing

Hello I am doing this problem using RE and the task is to extract the information of a make-up HTML. Title and content is what I need. This is what I came up with so far.
<body>([^<]*)(?:<[^>]*+>)*([^<]*)(?:<[^>]*+>)*([^<]*)(?:<[^>]*+>)*([^<]*)(?:<[^>]*+>)*<\/body>
I know its just repeating the same RE but I couldn't match it otherwise, so please help me there as well.
Title being in the <title> </title> and content being in <body> </body>. But there is a problem. I need to ignore all the /n in the text and get only the text.
this is some sample text :
<html>\n<head><title>Some title</title></head>\n<body>Here<p> is some </p>content <a href="www.somesite.com">\nclick</body>\n</html>
also I know that I should not parse HTML with RE from here RegEx match open tags except XHTML self-contained tags, but my task requires me to use RE.

Related

Google SDTT appending "#__sid=md3" to URL for mainEntityOfPage

Why is this happening?
HTML shows:
<meta content='http://www.costumingdiary.com/2015/05/freddie-mercury-robe-francaise.html' itemprop='mainEntityOfPage' itemscope='itemscope'/>
Structured Data Testing Tool output shows:
http://www.costumingdiary.com/2015/05/freddie-mercury-robe-francaise.html#__sid=md3
Update: It looks like it has to do with my breadcrumb list. But still, why is it happening, and is it wrong?
If the URL you want to provide is unique you can use the itemid property.
I was confronted with mainEntityOfPage by the tool after the latest update. And using Google's example I used the following code
<meta itemscope itemprop="mainEntityOfPage" itemType="https://schema.org/WebPage" itemid="https://blog.hompus.nl/2015/12/04/json-on-a-diet-how-to-shrink-your-dtos-part-2-skip-empty-collections/" />
And this show up correctly in the Structured Data Testing Tool results for my blog
I don’t know where the fragment #__sid=md3 is coming from, but as the SDTT had some quirks with BreadcrumbList in the past, it might also be a side effect of this.
But note that if you want to provide a URL as value for the mainEntityOfPage property, you must use a link element instead of a meta element:
<link itemprop="mainEntityOfPage" href="http://www.costumingdiary.com/2015/05/freddie-mercury-robe-francaise.html" />
(See examples for Microdata markup that creates an item value, instead of a URL value, for mainEntityOfPage.)

Diamond question mark showing up in HTML (from MySqli)

I have a database which I fill directly in PhpMyAdmin. There are some special characters like é in it. In PhpMyAdmin, they show perfectly. When I convert them to PDF with FPDF/PHP they show perfectly as well. When I want to show them in HTML (with PHP), they become diamond shapes with a question mark in it.
I know this has something to do with charsets and collation, but I'm a total noob on that matter. My database has collation latin1_swedish_ci and I want to keep it that way. My page has <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> in the header, so they should be the same I think, still, I get question mark diamonds... How do I get the characters I want?
Please don't say I have to convert everything to utf-8. I really want to keep latin1_swedish_ci. I tried charset=utf-8 in the header and converting one of the fields in my database to utf8_general_ci, but that field didn't show up right either, so something else must be wrong...
Thanks to the comments of deceze, I added to the top of my PHP:
<?
header('Content-Type: text/html; charset=iso-8859-1');
?>
Never knew that the meta-tag http-equiv was only a fallback!
These tags are only fallbacks though which are only used when no HTTP Content-Type header was encountered (the wording "http-equiv" hints at this). It's also conceptually weird, since these tags are inside the document itself and the browser needs to read the document first in order to figure out what kind of document it's dealing with.
Source: Handling Unicode Front To Back In A Web App
This source helped a lot! Thanks deceze!
Put this inside <head> tag
<meta charset="iso-8859-1">

iMacros get the ID of a div, not the content

I am trying to learn iMacros (and avoid jscript or vbscript IF possible). I was reading any resource i could find since yesterday and the imacros reference does not have any helpful example of what i need.
All the methods I tried, will extract either the TXT or the HTM content of an element. My problem is that i have a div like this
<div class="cust_div" id="Customer_45621">
...content in here...
</div>
And the part i need to extract is 45621 which is the only dynamic part of the id attribute.
For example, between 3 customers, it could be
Customer_45621
Customer_35123
Customer_85663
All I need is the number. Thanks.
The solution is
TAG POS=1 TYPE=DIV ATTR=cust_div EXTRACT=HTM
Then you have to use EVAL and use in it JS scripting to extract the id. That is the only way. You can't cut the HTML code without JS, but you can use JS in iMacros with EVAL.

Ruby -- trying to grab <title>this here</title> even if on multiple lines

Currently, I am grabbing titles using the following method:
title = html_response[/<title[^>]*>(.*?)<\/title>/,1]
This does a great job at catching "This is a title" from <title>This is a title</title>. However, there are some web pages that open the title tag on one line, print the title on the next line, and then close the title tag.
The Ruby line I presented above doesn't catch titles such as those, so I'm just trying to find a fix for that.
This famous stackoverflow post explains why it's a bad idea to use regular expressions to parse HTML. A better approach is to use a gem like Nokogiri to parse out the title tags.
Obligatory don't use regex with HTML sentence.
title = html_response[/<title[^>]*>(.*?)<\/title>/m,1]
The m enables multiline mode.

Truncating HTML with Liquid

I'm using the Liquid templating engine to display a summarised series of posts - at the moment I have something along these lines:
{% for page in site.posts %}
{{page.content | truncatewords: 100}}
{% endfor %}
The page content contains HTML, and using truncatewords can cause invalid HTML to be inserted in the output. I don't want to remove all of the HTML from the content (embedded videos and images should be visible), and ideally all I want is for the appropriate closing tags to be added.
I can see that merely truncating isn't going to achieve my expected outcome, so my question is: How can I truncate my HTML in order to output valid markup using Liquid?
Update
A very specific problem is that I have a code sample that's marked-up using Pigments. Now, if the truncation occurs in the middle of the code sample, it leaves several tags open, messing up the rest of the page. I'm looking for a way to truncate these posts without removing all of the code sample - just to truncate and close all open tags in the content body.
OK, so after not being able to find much in the way of doing this on the web, I cooked up my own solution utilising Nokogiri and a depth-first traversal of the parsed HTML node tree.
TruncateHTML is a simple script that allows a snippet of HTML to be truncated while preserving a valid structure.

Resources