How to parse a remote website and create a link on every single word for a dictionary tooltip? - ruby-on-rails

I want to parse a random website, modify the content so that every word is a link (for a dictionary tooltip) and then display the website in an iframe.
I'm not looking for a complete solution, but for a hint or a possible strategy. The linking is my problem, parsing the website and displaying it in an iframe is quite simple. So basically I have a String with all the html content. I'm not even sure if it's better to do it serverside or after the page is loaded with JS.
I'm working with Ruby on Rails, jQuery, jRails.
Note: The content of the href tag depends on the word.
Clarification:
I tried a regexp and it already kind of works:
#site.gsub!(/[A-Za-z]+(?:['-][A-Za-z]+)?|\\d+(?:[,.]\\d+)?/) {|word| '' + word + ''}
But the problem is to only replace words in the text and leave the HTML as it is. So I guess it is a regex problem...
Thanks for any ideas.

I don't think a regexp is going to work for this - or, at least, it will always be brittle. A better way is to parse the page using Hpricot or Nokogiri, then go through it and modify the nodes that are plain text.

It sounds like you have it mostly planned out already.
Split the content into words and then for each word, create a link, such as whatever
EDIT (based on your comment):
Ahh ... I recommend you search around for screen scraping techniques. Most of them should start with removing anything between < and > characters, and replacing <br> and <p> with newlines.

I would use Nokogiri to remove the HTML structure before you use the regex.
no_html = Nokogiri::HTML(html_as_string).text

Simple. Hash the HTML, run your regex, then unhash the HTML.
<?php
class ht
{
static $hashes = array();
# hashes everything that matches $pattern and saves matches for later unhashing
function hash($text, $pattern) {
return preg_replace_callback($pattern, array(self,'push'), $text);
}
# hashes all html tags and saves them
function hash_html($html) {
return self::hash($html, '`<[^>]+>`');
}
# hashes and saves $value, returns key
function push($value) {
if(is_array($value)) $value = $value[0];
static $i = 0;
$key = "\x05".++$i."\x06";
self::$hashes[$key] = $value;
return $key;
}
# unhashes all saved values found in $text
function unhash($text) {
return str_replace(array_keys(self::$hashes), self::$hashes, $text);
}
function get($key) {
return self::$hashes[$key];
}
function clear() {
self::$hashes = array();
}
}
?>
Example usage:
ht::hash_html($your_html);
// your word->href converter here
ht::unhash($your_formatted_html);
Oh... right, I wrote this in PHP. Guess you'll have to convert it to ruby or js, but the idea is the same.

Related

Bookmarks parsing issue

I have a LARGE number of bookmarks and wanted to export them and share them with a group I work with. The issue is that when I export them, there are ADD_DATE and LAST_MODIFIED fields added by the browser (Firefox). I was hoping to just use cut or awk to pull the fields I want but the lack of a space before the >(website_name) is making that difficult. And my regex skills are weak.
How do I add a single space before the second to last > at the end of the line so that I can use cut or awk to pull out the fields I want into a new file?
Ex: 123456">SecurityTrails would become 123456 >SecurityTrails
Please see below for examples of what I'm working with. Any help is greatly appreciated!
<DT>SecurityTrails
i use firefox myself. it frequently also embeds favicon into the exported bookmarks.html file via base64 encoding. so to account for the different scenarios (than just the one mentioned by OP), maybe something like
{mawk/mawk2/gawk} 'BEGIN { FS = "\042" } $1 = $1'
then do whatever cutting that you want. That's just assuming OP wanted to keep every bit of it, and simply remove the quotations.
Now, if the objective is just to take out URL+Name of it,
{mawk/mawk2/gawk} 'BEGIN { DBLQT="\042"; FS = "(<A HREF=" DBLQT "|>)" } /<A HREF=/ {
url = substr($2, 1, index($2, DBLQT) - 1);
sitename = $(NF-1);
sub(/<\/A$/, "", sitename) ;
print url " > " sitename ; }' # or whatever way you want the output to be
I just typed it in extra verbosity to show what \042 meant - the ascii octal for double quote.

jQuery Mobile Filtered List - only match beginning of string

Im using the jQuery mobile search filter list:
http://jquerymobile.com/test/docs/lists/lists-performance.html
Im having somer performance issues, my list is a little slow to filter on some phones. To try and aid performance I want to change the search so only items starting with the search text are returned.
So 'aris' currently finds the result 'paris' but I want this changed. I can see its possible from the documentation below but I dont know how to implement the code.
http://jquerymobile.com/test/docs/lists/docs-lists.html
$("document").ready( function (){
$(".ui-listview").listview('option', 'filterCallback', yourFilterFunction)
});
This seems to demonstrate how you write and call your own function, but ive no idea how to write it! Thanks
http://blog.safaribooksonline.com/2012/02/14/jquery-mobile-tip-write-your-own-list-view-filter-function/
UPDATE - Ive tried the following in a seperate js file:
$("document").ready( function (){
function beginsWith( text, pattern) {
text= text.toLowerCase();
pattern = pattern.toLowerCase();
return pattern == text.substr( 0, pattern.length );
}
$(".ui-listview").listview('option', 'filterCallback', beginsWith)
});
might look something like this:
function beginsWith( text, pattern) {
text= text.toLowerCase();
pattern = pattern.toLowerCase();
return pattern == text.substr( 0, pattern.length );
}
Basically you compare from 0 to "length" of what you're matching to the source. So if you pass in "test","tester" it will see you're passing in a string of length 4 and then substr "tester" from 0,4, which gives you "test". Then "test" is equal to "test"... so return true. Lowercase them to make it case insensitive.
Another trick to improve filter performance, only filter once they've entered more than 1 character.
edit it appears jQueryMobile's filter function expects that "true" means it was not found... so it needs to be backwards. return pattern != text.substr( 0, pattern.length );
This worked for me. I am using regular expression here so sort of different way to achieve the same thing.
But the reason why my code didn't work initially was that the list item had a lot of spaces at the beginning and at the end (found that it got added on it's own while debugging).
So I do a trim on the text before doing the match. I have a feeling Jonathan Rowny's implementation will also work if we do text.trim() before matching.
$(".ui-listview").listview('option', 'filterCallback', function (text, searchValue) {
var matcher = new RegExp("^" + searchValue, "i");
return !matcher.test(text.trim());
});

How do you include hashtags within Twitter share link text?

I'm writing a site with a custom tweet button that uses the www.twitter.com/share function, however the problem I am having is including hash '#' characters within the tweet text.
For example:
http://www.twitter.com/share?url=www.example.com&text=I+am+eating+#branstonpickel+right+now
The tweet text comes out as 'I am eating' and omits the hash and everything after.
I had a quick look on the Twitter forums and learnt the hash '#' character cannot be part of the share url. On https://dev.twitter.com/discussions/512#comment-877 it was said that:
Hashes are special characters in the URL (they identify document fragments) so they, and anything following, does not get sent the server.
and
you need to URLEncode it, so use %23
When I tried the 2nd point in my test link:
www.twitter.com/share?url=www.example.com&text=I+am+eating+%23branstonpickel+right+now
The tweet text came out as 'I am eating %23branstonpickel right now' literally including %23 instead of converting it to a hash.
Sorry for the waffely question, but does anyone know what it is I'm doing wrong?
Any feedback would be greatly appreciated :)
It looks like this is the basic setup:
https://twitter.com/intent/tweet?
url=<url to tweet>
text=<text to tweet>
hashtags=<comma separated list of hashtags, with no # on them>
This would pre-built a tweet of: <text> <url> <hashtags>
The above example would be:
https://twitter.com/intent/tweet?url=http://www.example.com&text=I+am+eating+branston+pickel+right+now&hashtags=bransonpickel,pickles
There used to be a bug with the hashtags parameter... it only showed the first n-1 hashtags. Currently this is fixed.
you can use %23 instead of hash (#) in url eg
http://www.twitter.com/share?url=www.example.com&text=I+am+eating+%23branston+%23pickel+right+now
I may be wrong but i think the hashtag has to be passed as a separate variable that will appear at the end of your tweet ie:
http://www.twitter.com/share?url=www.example.com&text=I+am+eating+branston+pickel+right+now&hashtag=bransonpickel
will result in "I am eating branston pickel right now #branstonpickle"
On a separate note, I think pickel should be pickle!
Cheers
Toby
use encodeURIComponent to encode the url
If you're using PHP, you can use the following:
<?php echo 'http://www.twitter.com/share?' . http_build_query(array(
'url' => 'http://www.example.com',
'text' => 'I am eating #branstonpickel right now'
)); ?>
This will do all the URL encoding for you, and it's easy to read.
For more information on the http_build_query, see the PHP manual:
http://us2.php.net/http_build_query
For url with line jump, # , # and special unicode in it, the following works :
var lineJump = encodeURI(String.fromCharCode(10)),
hash = "%23", arobase="%40",
tweetText = 'https://twitter.com/intent/tweet?text=Le signe chinois '+hans+' '+item.pinyin+': '+item.definition.replace(";",",")+'.'
+lineJump+'Merci '+arobase+'Inalco_Officiel '+arobase+'CRIparis ❤️🇨🇳 '
+lineJump+hash+'Chinois '+hash+'MOOC'
+lineJump+'https://hanzi.cri-paris.org/',
tweetTxtUrlEncoded = tweetText+ "" +encodeURIComponent('#'+lesson+encodeURIComponent(hans));
urlencode
https://twitter.com/intent/tweet?text=<?= urlencode("I am eating #branstonpickel right now"); ?>"
You can just use this code and modify it
20% means space
23% means hashtag
In JS you can easily encode the special characters using encoreURIComponent.
(Warning: don't use encodeURI as "#" and "#" are not escaped.)
Here's an example with mention and hashtag:
const text = "Hello #world ! Go follow #StackOverflow";
const tweetUrl = `https://twitter.com/intent/tweet?text=${ encodeURIComponent(text) }`;

nicedit - is it safe and is it affected by the site's css?

I'm considering using nicedit (http://nicedit.com/) for my site.
I assume that nicedit simply creates simple html using the buttons, and that html gets sent when the user saves it.
Is it recommended? Is someone still working on it?
Assuming I'm later displaying this HTML in my site somewhere, isn't it dangerous due to the user being able to plant malicious javascript? If not, how does nicedit prevents this?
Also, when I display this HTML later, will it be affected by my css? If so, how can I prevent this?
Thanks.
This is what I use it works like a charm for cleaning out the content of the nicedit instance before chucking into the database
function cleanFromEditor($text) {
//try to decode html before we clean it then we submit to database
$text = stripslashes(html_entity_decode($text));
//clean out tags that we don't want in the text
$text = strip_tags($text,'<p><div><strong><em><ul><ol><li><u><blockquote><br><sub><img><a><h1><h2><h3><span><b>');
//conversion elements
$conversion = array(
'<br>'=>'<br />',
'<b>'=>'<strong>',
'</b>'=>'</strong>',
'<i>'=>'<em>',
'</i>'=>'</em>'
);
//clean up the old html with new
foreach($conversion as $old=>$new){
$text = str_replace($old, $new, $text);
}
return htmlentities(mysql_real_escape_string($text));
}
It doesn't appear to be maintained anymore. But I have used it for purposes where I needed just a simple/lightweight WYSIWYG editor. If you are looking for something that gets constant core updates or additional features I wouldn't count on it. I finally broke down and wrote a lot of my own features like tables and YouTube videos.
Yes, a hacker could use it to post an client and/or server exploit on your site. But this is a threat you can face with any editor. You need to filter the code for two methods.
You need to prevent SQL injection by sanitizing your post variables. I always put this at the beginning of my scripts to clean them and call them with $input['whateveryouarepassing']instead of $_POST['whateveryouarepassing']. Edit the $mysqli->real_escape_string() parts to work with your database object. Use MySQLi or PDO with prepared statements to help harden the attack.
$input = array();
if(isset($_POST)) {
foreach ($_POST as $key => $value) {
if (#get_magic_quotes_gpc()) {
$key = stripslashes($key);
$value = stripslashes($value);
}
$key = $mysqli->real_escape_string($key);
$value = $mysqli->real_escape_string($value);
$input[$key] = $value;
}
}
Then I like to clean it with this function I put together over the years with various methods of cleaning out bad code. Use HTML Purifier instead if you can set it up. If not, here is this bad boy. Call it with cleanHTML($input['whateveryouarepassing']);.
function cleanHTML($string) {
$string = preg_replace('#(&\#*\w+)[\x00-\x20]+;#u', "$1;", $string);
$string = preg_replace('#(&\#x*)([0-9A-F]+);*#iu', "$1$2;", $string);
$string = html_entity_decode($string, ENT_COMPAT, "UTF-8");
$string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])(on|xmlns)[^>]*>#iUu', "$1>", $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*j[\x00-\x20]*a[\x00-\x20]*v[\x00-\x20]*a[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2nojavascript...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*v[\x00-\x20]*b[\x00-\x20]*s[\x00-\x20]*c[\x00-\x20]*r[\x00-\x20]*i[\x00-\x20]*p[\x00-\x20]*t[\x00-\x20]*:#iUu', '$1=$2novbscript...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*-moz-binding[\x00-\x20]*:#Uu', '$1=$2nomozbinding...', $string);
$string = preg_replace('#([a-z]*)[\x00-\x20\/]*=[\x00-\x20\/]*([\`\'\"]*)[\x00-\x20\/]*data[\x00-\x20]*:#Uu', '$1=$2nodata...', $string);
$string = preg_replace('#(<[^>]+[\x00-\x20\"\'\/])style[^>]*>#iUu', "$1>", $string);
$string = preg_replace('#</*\w+:\w[^>]*>#i', "", $string);
$string = preg_replace('/^<\?php(.*)(\?>)?$/s', '$1', $string);
$string = preg_replace('#</*(applet|meta|xml|blink|link|script|embed|object|frame|iframe|frameset|ilayer|layer|bgsound|title|base)[^>]*>#i', "", $string);
return $string;
}
The HTML will be affected by your CSS when editing and displayed. You will need code additional CSS rules if this is an issue. If the issue is when editing move to a iframe based editor and to prevent the css display the html content in an iframe.
If you want another suggestion elRTE is my goto editor these days. A little more advanced but totally worth it once you get to know the code base and API. You will face the same issues as above as will any editor. Except the CSS during editing since elRTE is framebased and you can specify stylesheets. elRTE Homepage
Edit: I posted this assuming you were using PHP. Apologies if not.

Get fragment (value after hash '#') from a URL [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
How can i select the fragment after the '#' symbol in my URL using PHP?
The result that i want is "photo45".
This is an example URL:
http://example.com/site/gallery/1#photo45
If you want to get the value after the hash mark or anchor as shown in a user's browser: This isn't possible with "standard" HTTP as this value is never sent to the server (hence it won't be available in $_SERVER["REQUEST_URI"] or similar predefined variables). You would need some sort of JavaScript magic on the client side, e.g. to include this value as a POST parameter.
If it's only about parsing a known URL from whatever source, the answer by mck89 is perfectly fine though.
That part is called "fragment" and you can get it in this way:
$url=parse_url("http://example.com/site/gallery/1#photo45 ");
echo $url["fragment"]; //This variable contains the fragment
A) already have url with #hash in PHP? Easy! Just parse it out !
if( strpos( $url, "#" ) === false ) echo "NO HASH !";
else echo "HASH IS: #".explode( "#", $url )[1]; // arrays are indexed from 0
Or in "old" PHP you must pre-store the exploded to access the array:
$exploded_url = explode( "#", $url ); $exploded_url[1];
B) You want to get a #hash by sending a form to PHP?     => Use some JavaScript MAGIC! (To pre-process the form)
var forms = document.getElementsByTagName('form'); //get all forms on the site
for (var i = 0; i < forms.length; i++) { //to each form...
forms[i].addEventListener( // add a "listener"
'submit', // for an on-submit "event"
function () { //add a submit pre-processing function:
var input_name = "fragment"; // name form will use to send the fragment
// Try search whether we already done this or not
// in current form, find every <input ... name="fragment" ...>
var hiddens = form.querySelectorAll('[name="' + input_name + '"]');
if (hiddens.length < 1) { // if not there yet
//create an extra input element
var hidden = document.createElement("input");
//set it to hidden so it doesn't break view
hidden.setAttribute('type', 'hidden');
//set a name to get by it in PHP
hidden.setAttribute('name', input_name);
this.appendChild(hidden); //append it to the current form
} else {
var hidden = hiddens[0]; // use an existing one if already there
}
//set a value of #HASH - EVERY TIME, so we get the MOST RECENT #hash :)
hidden.setAttribute('value', window.location.hash);
}
);
}
Depending on your form's method attribute you get this hash in PHP by:
$_GET['fragment'] or $_POST['fragment']
Possible returns: 1. ""[empty string] (no hash) 2. whole hash INCLUDING the #[hash] sign (because we've used the window.location.hash in JavaScript which just works that way :) )
C) You want to get the #hash in PHP JUST from requested URL?
                                    YOU CAN'T !
...(not while considering regular HTTP requests)...
...Hope this helped :)
I've been searching for a workaround for this for a bit - and the only thing I have found is to use URL rewrites to read the "anchor". I found in the apache docs here http://httpd.apache.org/docs/2.2/rewrite/advanced.html the following...
By default, redirecting to an HTML anchor doesn't work, because mod_rewrite escapes the # character, turning it into %23.
This, in turn, breaks the redirection.
Solution: Use the [NE] flag on the RewriteRule. NE stands for No
Escape.
Discussion: This technique will of course also work with other special
characters that mod_rewrite, by default, URL-encodes.
It may have other caveats and what not ... but I think that at least doing something with the # on the server is possible.
You can't get the text after the hash mark. It is not sent to the server in a request.
I found this trick if you insist want the value with PHP.
split the anchor (#) value and get it with JavaScript, then store as cookie, after that get the cookie value with PHP
If you are wanting to dynamically grab the hash from URL, this should work:
https://stackoverflow.com/a/57368072/2062851
<script>
var hash = window.location.hash, //get the hash from url
cleanhash = hash.replace("#", ""); //remove the #
//alert(cleanhash);
</script>
<?php
$hash = "<script>document.writeln(cleanhash);</script>";
echo $hash;
?>
You can do it by a combination of javascript and php:
<div id="cont"></div>
And by the other side;
<script>
var h = window.location.hash;
var h1 = (win.substr(1));//string with no #
var q1 = '<input type="text" id="hash" name="hash" value="'+h1+'">';
setInterval(function(){
if(win1!="")
{
document.querySelector('#cont').innerHTML = q1;
} else alert("Something went wrong")
},1000);
</script>
Then, on form submit you can retrieve the value via $_POST['hash'] (set the form)
You need to parse the url first, so it goes like this:
$url = "https://www.example.com/profile#picture";
$fragment = parse_url($url,PHP_URL_FRAGMENT); //this variable holds the value - 'picture'
If you need to parse the actual url of the current browser, you need to request to call the server.
$url = $_SERVER["REQUEST_URI"];
$fragment = parse_url($url,PHP_URL_FRAGMENT); //this variable holds the value - 'picture'
Getting the data after the hashmark in a query string is simple. Here is an example used for when a client accesses a glossary of terms from a book. It takes the name anchor delivered (#tesla), and delivers the client to that term and highlights the term and its description in blue so its easy to see.
setup your strings with a div id, so the name anchor goes where its supposed to and the JavaScript can change the text colors
<div id="tesla">Tesla</div>
<div id="tesla1">An energy company</div>
Use JavaScript to do the heavy work, on the server side, inserted in your PHP page, or wherever..
<script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
I am launching the Java function automatically when the page is loaded.
<script>
$( document ).ready(function() {
get the anchor (#tesla) from the URL received by the server
var myhash1 = $(location).attr('hash'); //myhash1 == #tesla
trim the hash sign off of it
myhash1 = myhash1.substr(1) //myhash1 == tesla
I need to highlight the term and the description so I create a new var
var myhash2 = '1';
myhash2 = myhash1.concat(myhash2); //myhash2 == tesla1
Now I can manipulate the text color for the term and description
var elem = document.getElementById(myhash1);
elem.style.color = 'blue';
elem = document.getElementById(myhash2);
elem.style.color = 'blue';
});
</script>
This works. client clicks link on client side (example.com#tesla) and goes right to the term. the term and the description are highlighted in blue by JavaScript for quick reading .. all other entries left in black..

Resources