Replacing caption between <a> with attributes - hyperlink

I'm trying to preg_replace a link caption as below. Can't find an example tho where replacing would consider tag attributes, not just clean tags
Basically, this
Database Title
needs to become this
My Own Title
Help appreciated

you can a use a regular expression along with back references
$html = 'Database Title;
$html = preg_replace("/(<.+>).+(<.+>)/", "$1My Own Title$2", $html);
echo $html;
http://www.php.net/manual/en/regexp.reference.back-references.php

Related

HTMLPurifier removing <a name="#someanchorname"></a> - how to stop this from happening?

Using htmlpurifier 4.10, anchor name tags are being stripped out of text.
Current config:
$class_file = 'static/htmlpurifier-4.10.0-lite/library/HTMLPurifier.auto.php';
$class_html_cleaner = 'HTMLPurifier';
require_once($class_file);
// Initiate config
$config = HTMLPurifier_Config::createDefault();
$config->set('AutoFormat.AutoParagraph', FALSE);
$config->set('AutoFormat.RemoveEmpty', TRUE);
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp', TRUE);
// initiate class
$purifier = new HTMLPurifier($config);
// clean passed HTML
$html = $purifier->purify($html);
Adding the config HTML.Allowed:
$config->set('AutoFormat.AutoParagraph', FALSE);
$config->set('AutoFormat.RemoveEmpty', TRUE);
$config->set('AutoFormat.RemoveEmpty.RemoveNbsp', TRUE);
$config->set('HTML.Allowed', 'a[href|target|name|id|class]');
Does nothing, the name tags are still removed.
Removing three AutoFormat options so I just have this:
$config->set('HTML.Allowed', 'a[href|target|name|id|class]');
Also strips the name attribute, but at least now the name tag I posted is returned as <a></a>.
What else am I missing here? I'd rather not use HTML.Allowed if it means I have to explicitly state every other potential tag/attribute I would ever use.
Guidance/help greatly appreciated. Been fighting with this for an hour now.
The Attr.EnableID rule removes html id attributes by default. (And it looks like name attributes as well.)
http://htmlpurifier.org/live/configdoc/plain.html#HTML.EnableAttrID
Why it happens is explained here, http://htmlpurifier.org/docs/enduser-id.html.

XML Parsing - node.text method removing trailling spaces

I have a big xml file which i'm parsing using jscript. I have used the following code to load the xml
var xmlDoc = Sys.OleObject("Msxml2.DOMDocument.6.0");
xmlDoc.async = false;
// Load xml data from a file
xmlDoc.load(this._studyDocPath);
Now if i use the following code
var text = this.xmlDoc.selectSingleNode(xPath);
text = node.text;
the text variable holds the innertext of a perticular tag. But if I have tag like this
<Text>ABCD </Text>
then the node.text returns me only the value 'ABCD' i.e. it automatically trims the space. But I dont need to trim any trailling spaces. I need the text as it is. How can I achieve that?
Looking forward to your response
Thanks in Advance
We can use node.firstChild.nodeValue with a null check on node.firstChild

How do you include hashtags within Twitter share link text?

I'm writing a site with a custom tweet button that uses the www.twitter.com/share function, however the problem I am having is including hash '#' characters within the tweet text.
For example:
http://www.twitter.com/share?url=www.example.com&text=I+am+eating+#branstonpickel+right+now
The tweet text comes out as 'I am eating' and omits the hash and everything after.
I had a quick look on the Twitter forums and learnt the hash '#' character cannot be part of the share url. On https://dev.twitter.com/discussions/512#comment-877 it was said that:
Hashes are special characters in the URL (they identify document fragments) so they, and anything following, does not get sent the server.
and
you need to URLEncode it, so use %23
When I tried the 2nd point in my test link:
www.twitter.com/share?url=www.example.com&text=I+am+eating+%23branstonpickel+right+now
The tweet text came out as 'I am eating %23branstonpickel right now' literally including %23 instead of converting it to a hash.
Sorry for the waffely question, but does anyone know what it is I'm doing wrong?
Any feedback would be greatly appreciated :)
It looks like this is the basic setup:
https://twitter.com/intent/tweet?
url=<url to tweet>
text=<text to tweet>
hashtags=<comma separated list of hashtags, with no # on them>
This would pre-built a tweet of: <text> <url> <hashtags>
The above example would be:
https://twitter.com/intent/tweet?url=http://www.example.com&text=I+am+eating+branston+pickel+right+now&hashtags=bransonpickel,pickles
There used to be a bug with the hashtags parameter... it only showed the first n-1 hashtags. Currently this is fixed.
you can use %23 instead of hash (#) in url eg
http://www.twitter.com/share?url=www.example.com&text=I+am+eating+%23branston+%23pickel+right+now
I may be wrong but i think the hashtag has to be passed as a separate variable that will appear at the end of your tweet ie:
http://www.twitter.com/share?url=www.example.com&text=I+am+eating+branston+pickel+right+now&hashtag=bransonpickel
will result in "I am eating branston pickel right now #branstonpickle"
On a separate note, I think pickel should be pickle!
Cheers
Toby
use encodeURIComponent to encode the url
If you're using PHP, you can use the following:
<?php echo 'http://www.twitter.com/share?' . http_build_query(array(
'url' => 'http://www.example.com',
'text' => 'I am eating #branstonpickel right now'
)); ?>
This will do all the URL encoding for you, and it's easy to read.
For more information on the http_build_query, see the PHP manual:
http://us2.php.net/http_build_query
For url with line jump, # , # and special unicode in it, the following works :
var lineJump = encodeURI(String.fromCharCode(10)),
hash = "%23", arobase="%40",
tweetText = 'https://twitter.com/intent/tweet?text=Le signe chinois '+hans+' '+item.pinyin+': '+item.definition.replace(";",",")+'.'
+lineJump+'Merci '+arobase+'Inalco_Officiel '+arobase+'CRIparis ❤️🇨🇳 '
+lineJump+hash+'Chinois '+hash+'MOOC'
+lineJump+'https://hanzi.cri-paris.org/',
tweetTxtUrlEncoded = tweetText+ "" +encodeURIComponent('#'+lesson+encodeURIComponent(hans));
urlencode
https://twitter.com/intent/tweet?text=<?= urlencode("I am eating #branstonpickel right now"); ?>"
You can just use this code and modify it
20% means space
23% means hashtag
In JS you can easily encode the special characters using encoreURIComponent.
(Warning: don't use encodeURI as "#" and "#" are not escaped.)
Here's an example with mention and hashtag:
const text = "Hello #world ! Go follow #StackOverflow";
const tweetUrl = `https://twitter.com/intent/tweet?text=${ encodeURIComponent(text) }`;

nicedit - is it safe and is it affected by the site's css?

I'm considering using nicedit (http://nicedit.com/) for my site.
I assume that nicedit simply creates simple html using the buttons, and that html gets sent when the user saves it.
Is it recommended? Is someone still working on it?
Assuming I'm later displaying this HTML in my site somewhere, isn't it dangerous due to the user being able to plant malicious javascript? If not, how does nicedit prevents this?
Also, when I display this HTML later, will it be affected by my css? If so, how can I prevent this?
Thanks.
This is what I use it works like a charm for cleaning out the content of the nicedit instance before chucking into the database
function cleanFromEditor($text) {
//try to decode html before we clean it then we submit to database
$text = stripslashes(html_entity_decode($text));
//clean out tags that we don't want in the text
$text = strip_tags($text,'<p><div><strong><em><ul><ol><li><u><blockquote><br><sub><img><a><h1><h2><h3><span><b>');
//conversion elements
$conversion = array(
'<br>'=>'<br />',
'<b>'=>'<strong>',
'</b>'=>'</strong>',
'<i>'=>'<em>',
'</i>'=>'</em>'
);
//clean up the old html with new
foreach($conversion as $old=>$new){
$text = str_replace($old, $new, $text);
}
return htmlentities(mysql_real_escape_string($text));
}
It doesn't appear to be maintained anymore. But I have used it for purposes where I needed just a simple/lightweight WYSIWYG editor. If you are looking for something that gets constant core updates or additional features I wouldn't count on it. I finally broke down and wrote a lot of my own features like tables and YouTube videos.
Yes, a hacker could use it to post an client and/or server exploit on your site. But this is a threat you can face with any editor. You need to filter the code for two methods.
You need to prevent SQL injection by sanitizing your post variables. I always put this at the beginning of my scripts to clean them and call them with $input['whateveryouarepassing']instead of $_POST['whateveryouarepassing']. Edit the $mysqli->real_escape_string() parts to work with your database object. Use MySQLi or PDO with prepared statements to help harden the attack.
$input = array();
if(isset($_POST)) {
foreach ($_POST as $key => $value) {
if (#get_magic_quotes_gpc()) {
$key = stripslashes($key);
$value = stripslashes($value);
}
$key = $mysqli->real_escape_string($key);
$value = $mysqli->real_escape_string($value);
$input[$key] = $value;
}
}
Then I like to clean it with this function I put together over the years with various methods of cleaning out bad code. Use HTML Purifier instead if you can set it up. If not, here is this bad boy. Call it with cleanHTML($input['whateveryouarepassing']);.
function cleanHTML($string) {
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#u', "$1;", $string);
$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string);
$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu', "$1>", $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2nojavascript...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2novbscript...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*-moz-binding[\x00-\x20]*:#Uu', '$1=$2nomozbinding...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*data[\x00-\x20]*:#Uu', '$1=$2nodata...', $string);
$string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])style[^>]*>#iUu', "$1>", $string);
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('/^<\?php(.*)(\?>)?$/s', '$1', $string);
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|embed|object|frame|iframe|frameset|ilayer|layer|bgsound|title|base)[^>]*>#i', "", $string);
return $string;
}
The HTML will be affected by your CSS when editing and displayed. You will need code additional CSS rules if this is an issue. If the issue is when editing move to a iframe based editor and to prevent the css display the html content in an iframe.
If you want another suggestion elRTE is my goto editor these days. A little more advanced but totally worth it once you get to know the code base and API. You will face the same issues as above as will any editor. Except the CSS during editing since elRTE is framebased and you can specify stylesheets. elRTE Homepage
Edit: I posted this assuming you were using PHP. Apologies if not.

How to parse a remote website and create a link on every single word for a dictionary tooltip?

I want to parse a random website, modify the content so that every word is a link (for a dictionary tooltip) and then display the website in an iframe.
I'm not looking for a complete solution, but for a hint or a possible strategy. The linking is my problem, parsing the website and displaying it in an iframe is quite simple. So basically I have a String with all the html content. I'm not even sure if it's better to do it serverside or after the page is loaded with JS.
I'm working with Ruby on Rails, jQuery, jRails.
Note: The content of the href tag depends on the word.
Clarification:
I tried a regexp and it already kind of works:
#site.gsub!(/[A-Za-z]+(?:['-][A-Za-z]+)?|\\d+(?:[,.]\\d+)?/) {|word| '' + word + ''}
But the problem is to only replace words in the text and leave the HTML as it is. So I guess it is a regex problem...
Thanks for any ideas.
I don't think a regexp is going to work for this - or, at least, it will always be brittle. A better way is to parse the page using Hpricot or Nokogiri, then go through it and modify the nodes that are plain text.
It sounds like you have it mostly planned out already.
Split the content into words and then for each word, create a link, such as whatever
EDIT (based on your comment):
Ahh ... I recommend you search around for screen scraping techniques. Most of them should start with removing anything between < and > characters, and replacing <br> and <p> with newlines.
I would use Nokogiri to remove the HTML structure before you use the regex.
no_html = Nokogiri::HTML(html_as_string).text
Simple. Hash the HTML, run your regex, then unhash the HTML.
<?php
class ht
{
static $hashes = array();
# hashes everything that matches $pattern and saves matches for later unhashing
function hash($text, $pattern) {
return preg_replace_callback($pattern, array(self,'push'), $text);
}
# hashes all html tags and saves them
function hash_html($html) {
return self::hash($html, '`<[^>]+>`');
}
# hashes and saves $value, returns key
function push($value) {
if(is_array($value)) $value = $value[0];
static $i = 0;
$key = "\x05".++$i."\x06";
self::$hashes[$key] = $value;
return $key;
}
# unhashes all saved values found in $text
function unhash($text) {
return str_replace(array_keys(self::$hashes), self::$hashes, $text);
}
function get($key) {
return self::$hashes[$key];
}
function clear() {
self::$hashes = array();
}
}
?>
Example usage:
ht::hash_html($your_html);
// your word->href converter here
ht::unhash($your_formatted_html);
Oh... right, I wrote this in PHP. Guess you'll have to convert it to ruby or js, but the idea is the same.

Resources