How to parse Markdown in PHP? - parsing

First, I know, there already is a Markdown parser for PHP.
I also took a look to this question but it doesn't answer to my question.
Obviously, even if the title mention PHP, if it's language agnostic, because I'd like to know what are the step I've to go through to do that.
I've read about PEG, but I've to admit, I didn't really understand the example provided with the PHP parser.
I've also read about CFG.
I've found Zend_Markup_Parser_Textile which seems to construct a so called "Token Tree" (what's about it?) but it currently unusable. (Btw, Textile is not Markdown)
So, concretely, how would you go to this?
Obviously I though about using Regex but, I'm afraid.
Because Markdown supports several syntaxes for the same element (Setext and atx).
Could you give some starting point?

You should have a look at Parsedown.
It parses Markdown text the way people do. First, it divides texts into lines. Then it looks at how these lines start and relate to each other. Finally, it looks for special characters to identify inline elements.

There is PHP Markdown Extra that seems to be popular, you could start by looking at its source.

Also, there is an object-oriented implementation of Markdown which is faster: markdown-oo-php

Ciconia - A New Markdown Parser for PHP is a good one I found.
You just need 3 things to do :
1.Install Ciconia and parse file according to the document.
2. Add corresponding css theme to make it nice, like github markdown style or here.
3. Add syntax highlighting javascript, like google Javascript code prettifier.
Then everything will look pretty good.
If you want a complete example, here is my working demo for github style markdown:
<?php
header("Content-Type: text/html;charset=utf-8");
require 'vendor/autoload.php';
use Ciconia\Ciconia;
use Ciconia\Extension\Gfm;
$ciconia = new Ciconia();
$ciconia->addExtension(new Gfm\FencedCodeBlockExtension());
$ciconia->addExtension(new Gfm\TaskListExtension());
$ciconia->addExtension(new Gfm\InlineStyleExtension());
$ciconia->addExtension(new Gfm\WhiteSpaceExtension());
$ciconia->addExtension(new Gfm\TableExtension());
$ciconia->addExtension(new Gfm\UrlAutoLinkExtension());
$contents = file_get_contents('Readme.md');
$html = $ciconia->render($contents);
?>
<!DOCTYPE html>
<html>
<head>
<title>Excel to Lua table - Readme</title>
<script src="https://cdn.rawgit.com/google/code-prettify/master/loader/run_prettify.js"></script>
<link rel="stylesheet" href="./github-markdown.css">
<style>
.markdown-body {
box-sizing: border-box;
min-width: 200px;
max-width: 980px;
margin: 0 auto;
padding: 45px;
}
</style>
</head>
<body>
<article class="markdown-body">
<?php
# Put HTML content in the document
echo $html;
?>
</article>
</body>
</html>

Using regexes.
<?php
/**
* Slimdown - A very basic regex-based Markdown parser. Supports the
* following elements (and can be extended via Slimdown::add_rule()):
*
* - Headers
* - Links
* - Bold
* - Emphasis
* - Deletions
* - Quotes
* - Inline code
* - Blockquotes
* - Ordered/unordered lists
* - Horizontal rules
*
* Author: Johnny Broadway <johnny#johnnybroadway.com>
* Website: https://gist.github.com/jbroadway/2836900
* License: MIT
*/
class Slimdown {
public static $rules = array (
'/(#+)(.*)/' => 'self::header', // headers
'/\[([^\[]+)\]\(([^\)]+)\)/' => '<a href=\'\2\'>\1</a>', // links
'/(\*\*|__)(.*?)\1/' => '<strong>\2</strong>', // bold
'/(\*|_)(.*?)\1/' => '<em>\2</em>', // emphasis
'/\~\~(.*?)\~\~/' => '<del>\1</del>', // del
'/\:\"(.*?)\"\:/' => '<q>\1</q>', // quote
'/`(.*?)`/' => '<code>\1</code>', // inline code
'/\n\*(.*)/' => 'self::ul_list', // ul lists
'/\n[0-9]+\.(.*)/' => 'self::ol_list', // ol lists
'/\n(>|\>)(.*)/' => 'self::blockquote ', // blockquotes
'/\n-{5,}/' => "\n<hr />", // horizontal rule
'/\n([^\n]+)\n/' => 'self::para', // add paragraphs
'/<\/ul>\s?<ul>/' => '', // fix extra ul
'/<\/ol>\s?<ol>/' => '', // fix extra ol
'/<\/blockquote><blockquote>/' => "\n" // fix extra blockquote
);
private static function para ($regs) {
$line = $regs[1];
$trimmed = trim ($line);
if (preg_match ('/^<\/?(ul|ol|li|h|p|bl)/', $trimmed)) {
return "\n" . $line . "\n";
}
return sprintf ("\n<p>%s</p>\n", $trimmed);
}
private static function ul_list ($regs) {
$item = $regs[1];
return sprintf ("\n<ul>\n\t<li>%s</li>\n</ul>", trim ($item));
}
private static function ol_list ($regs) {
$item = $regs[1];
return sprintf ("\n<ol>\n\t<li>%s</li>\n</ol>", trim ($item));
}
private static function blockquote ($regs) {
$item = $regs[2];
return sprintf ("\n<blockquote>%s</blockquote>", trim ($item));
}
private static function header ($regs) {
list ($tmp, $chars, $header) = $regs;
$level = strlen ($chars);
return sprintf ('<h%d>%s</h%d>', $level, trim ($header), $level);
}
/**
* Add a rule.
*/
public static function add_rule ($regex, $replacement) {
self::$rules[$regex] = $replacement;
}
/**
* Render some Markdown into HTML.
*/
public static function render ($text) {
$text = "\n" . $text . "\n";
foreach (self::$rules as $regex => $replacement) {
if (is_callable ( $replacement)) {
$text = preg_replace_callback ($regex, $replacement, $text);
} else {
$text = preg_replace ($regex, $replacement, $text);
}
}
return trim ($text);
}
}
echo Slimdown::render ("# Title
And *now* [a link](http://www.google.com) to **follow** and [another](http://yahoo.com/).
* One
* Two
* Three
## Subhead
One **two** three **four** five.
One __two__ three _four_ five __six__ seven _eight_.
1. One
2. Two
3. Three
More text with `inline($code)` sample.
> A block quote
> across two lines.
More text...");
Origin https://gist.github.com/jbroadway/2836900

Related

Swift - Split text based on arabic combined characters

Dears,
I have arabic sentence like this stentence
أكل الولد التفاحة
how can i split the sentence based on UNCONNECTED characters to be like this :
أ-
كل
ا-
لو-
لد
ا-
لتفا-
حة
I put - to explain what i mean.
I just need to split the text into array based on that
How can i do that using swift code for ios ?
Update:
I dont care for the spaces.
"أكل" for example is one word and doesn't contain spaces.I want to split based on UNCONNECTED characters.
So "أكل" consist from two objects : "أ" and "كل"
الولد : three objects "ا" and "لو" and "لد"
Use the below code:
let a = "أكل الولد التفاحة".split(separator: " ")
You can replace spaces with "-" using replacing occurences function.
let text = "أكل الولد التفاحة".replacingOccurrences(of: " ", with: "-", options: NSString.CompareOptions.literal, range: nil) ?? ""
I don't know how accepted answer helps to fix the issue.
Apple already provided Natural Language Framework to handle such a things which more trustworthy
When you work with natural language text, it’s often useful to tokenize the text into individual words. Using NLTokenizer to enumerate words, rather than simply splitting components by whitespace, ensures correct behavior in multiple scripts and languages. For example, neither Chinese nor Japanese uses spaces to delimit words.
Here is example
let text = """
All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
"""
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = text
tokenizer.enumerateTokens(in: text.startIndex..<text.endIndex) { tokenRange, _ in
print(text[tokenRange])
return true
}
Here is link of Apple docs
Hope it is helpful
There is two box you can just click in first. Content automatically paste click convert. Output data automatically copied with spaces I used for this quran
<h1>Allah</h1>
<center>
<textarea id="field" onclick="paste(this)" style="font-size: xxx-large;min-width: 90%; min-height: 200px;"> </textarea>
<center>
</center>
</br>
<textarea id="field2" style="font-size: xxx-large;min-width: 95%; min-height: 200px;"> </textarea>
</center>
<center>
<br>
<button onclick="myFunction()" style="font-size: xx-large;min-width: 20%;">Convert</button>
</center>
<script >
function myFunction(){
var string = document.getElementById("field").value;
// Option 1
string.split('');
// Option 2
console.log(string);
// Option 3
Array.from(string);
// Option 4
var bb = Object.assign([], string);
console.log(bb);
cleanArray = bb.filter(function () { return true });
var filtered = bb.filter(function (el) {
return el != null; });
console.log(filtered);
var bb = bb.toString();
console.log(bb);
bb = bb.replace(",","");
var stringWithoutCommas = bb.replace(/,/g, ' ');
console.log(stringWithoutCommas);
document.execCommand(stringWithoutCommas)
document.getElementById("field2").value = stringWithoutCommas;
var copyTextarea = document.querySelector('#field2');
copyTextarea.focus();
copyTextarea.select();
try {
var successful = document.execCommand('copy');
var msg = successful ? 'successful' : 'unsuccessful';
console.log('Copying text command was ' + msg);
} catch (err) {
console.log('Oops, unable to copy');
}
};
/*
var copyTextareaBtn = document.querySelector('#newr');
copyTextareaBtn.addEventListener('click', function(event) {
var copyTextarea = document.querySelector('#field2');
copyTextarea.focus();
copyTextarea.select();
try {
var successful = document.execCommand('copy');
var msg = successful ? 'successful' : 'unsuccessful';
console.log('Copying text command was ' + msg);
} catch (err) {
console.log('Oops, unable to copy');
}
});
*/
async function paste(input) {
document.getElementById("field2").value = "";
const text = await navigator.clipboard.readText();
input.value = text;
}
</script>
Try this:
"أكل الولد التفاحة".map {String($0)}

Get code between tags and generate youtube embed

I have some tekst and in the middle of article I put {youtube}IPtv14q9ZDg{/youtube}. How to make code which is between {youtube} generated in youtube embed
I use PHP, but there might be other options.
I filter the text for the key-words. Then take the 11 digit code and wrap it in a link tag. Works best in a "for loop".
This is one I use to find url's in my text and make them live. But you can modify it to do what you want by changing the "preg_match" setting.
function make_clickable($string) {
$string = preg_replace("/[\n\r]/"," <br /> ",$string);
$arr = explode(' ', $string);
foreach($arr as $key => $value){
if(preg_match('#((^https?|http|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i', $value)){
$arr[$key] = "<a class=\"custome\" href='". $value ."' target=\"_blank\" class='link'>$value</a> ";
}
}
$string = implode(' ', $arr);
return $string;
}

How can I add page numbers to PDFKit generated PDFs?

I have multiple pages generated using PDFKit. How can I add page numbers to the bottom?
PDFKit.configure do |config|
config.default_options = {
header_right: "Page [page] of [toPage]"
}
end
kit = PDFKit.new(body_html)
Read all detailed documentation here:
http://madalgo.au.dk/~jakobt/wkhtmltoxdoc/wkhtmltopdf-0.9.9-doc.html
PDFKit is just a wrap up for wkhtmltopdf application that is written in C.
you need to specify a footer like this:
kit = PDFKit.new(your_html_content_for_pdf, :footer_html => "#{path_to_a_footer_html_file}")
then in the footer file have this:
<html>
<head>
<script type="text/javascript">
function subst() {
var vars={};
var x=document.location.search.substring(1).split('&');
for(var i in x) {var z=x[i].split('=',2);vars[z[0]] = unescape(z[1]);}
var x=['frompage','topage','page','webpage','section','subsection','subsubsection'];
for(var i in x) {
var y = document.getElementsByClassName(x[i]);
for(var j=0; j<y.length; ++j) y[j].textContent = vars[x[i]];
}
}
</script>
</head>
<body style="margin: 0;" onload="subst();">
Page <span class="page"></span> of <span class="topage"></span>
</body>
</html>
elements of classes 'frompage','topage','page','webpage','section','subsection','subsubsection' will get substituted with the appropriate data
I did page number with PDFKit, just by adding this:
%meta{:name => 'pdfkit-footer_right', :content => "[page]"}
in my haml file, in my RoR project.
For some weird reason, ( perhaps because I'm using slim ) - I have to use single quotes around the content, instead of double quotes - or else it attempts to escape the brackets and raw text "[page]" shows up, so try single quotes if you run into this issue with your pages.

Open Source Projects for i18n à la Facebook

Facebook has this unique and clever approach to localization of their site: translators (in their case users that help to translate the site voluntarily) can simply click on the not-yet-translated strings – which are marked with a green bottom border – in their natural context on the site. See http://www.facebook.com/translations/.
Now, if you ever had to deal with the translation of a website, you'll be well aware of how odd and funny some of these translations can be when using tools like poedit where the translator isn't fully aware of the spot the translated string will lated appear in on the website.
Example: Please translate "Home". In German, for instance, the start page of a website would be "Home" while the house you live in is "Heim". Now, you as the translator basically have to guess which context this term is likely to appear in on the website and translate accordingly. Chances are, you're new website on home furniture now translates as "Home-Einrichtung" which sounds ridiculous to any German.
So, my question boils down to:
Do you know any open source PHP projects that work on something like this? I'm basically looking for a framework that allows you to put your internationalized website in "translation mode" and make strings clickable and translatable e.g. through a Javascript modal.
I'm not so much looking for a full-fledged and ready-made solution, but would love to know about similar projects that I can contribute code to.
Thanks in advance!
If you want to roll your own with jquery & jquery browserLanguage, this might get you going.
Tag all translatable text's contain elements with class="i18n", and include jquery, jquery browserLanguage, and your i18n script.
1. the internationalization javascript
— this needs to accept translations via ajax from your server, like:
var i18n = {};
i18n.bank = new Array();
i18n.t = function ( text, tl=$.browserLanguage ) {
var r = false;
$.ajax({
url: "/i18n_t.php?type=request&from="+ escape(text) +"&tl="+ tl,
success: function(){ i18n.bank[text] = this; r = true; }
});
return r;
};
2. php i18n translation service
— now we need to serve up translations, and accept them
the database will look like a bunch of tables, one for each language.
// SCHEMA for each language:
CREATE TABLE `en`
(
`id` INT PRIMARY KEY AUTO INCREMENT NOT NULL,
`from` VARCHAR(500) NOT NULL,
`to` VARCHAR(500) NOT NULL
)
the php will need some connection and db manipulation.. for now this may do:
//Connect to the database
$connection = mysql_connect('host (usually localhost)', 'mysql_username' , 'mysql_password');
$selection = mysql_select_db('mysql_database', $connection);
function table_exists($tablename, $database = false) {
if(!$database) {
$res = mysql_query("SELECT DATABASE()");
$database = mysql_result($res, 0);
}
$res = mysql_query("SELECT COUNT(*) AS count FROM information_schema.tables WHERE table_schema = '$database' AND table_name = '$tablename'
");
return mysql_result($res, 0) == 1;
}
the code is simply:
<?php
// .. database stuff from above goes here ..
$type=$_GET["type"];
$from=$_GET["from"];
$to=$_GET["to"];
$tl=$_GET["tl"];
if (! table_exists($tl)) {
...
}
if ($type == "request") { // might want to set $tl="en" when ! table_exists($tl)
$find = mysql_query("SELECT to FROM `'$tl'` WHERE from='$from'");
$row = mysql_fetch_array($find);
echo $row['to'];
} elsif ($type == "suggest") {
$find = mysql_query("SELECT COUNT(*) AS count FROM `'$tl'` WHERE from='$from'");
if ( !(mysql_result($res, 0)) == 0 ) {
$ins = mysql_query("INSERT INTO `'$tl'` (from, to) VALUES ('$from','$to')");
}
}
?>
3. page translation mechanics
— finally we can tie them together in your webpages with some further jquery:
i18n.suggest = function (from) { // post user translation to our php
$.ajax({
url: "/i18n_t.php?type=suggest&from='+from+'&to="+ escape( $('#i18n_s').contents() ) +"&tl="+ $.browserLanguage,
success: function(){ $('#i18n_t_div').html('<em>Thanks!</em>').delay(334).fadeOut().remove(); }
});
};
$(document).ready(function() {
i18n.t("submit");
i18n.t("Thanks!");
$('.i18n').click( function(event) { //add an onClick event for all i18n spans
$('#i18n_t_div').remove;
$(this).parent().append(
'<div id="i18n_t_div"><form class="i18n_t_form">
<input type="text" id="i18n_s" name="suggestion" value="+$(this).contents()+" />
<input type="button" value="'+ i18n.bank[ "submit" ] +'" onclick="i18n.suggest( '+$(this).contents()+' )" />
</form></div>'
);
}).each(function(){
var c = $(this).contents(); //now load initial translations for browser language for all the internationalized content on the page
if ( i18n.t(c) ){
$(this).html(i18n.bank[c]);
}
});
});
Mind you I don't have a server to test this on... and I don't actually code php. :D It will take some debugging but the scaffolding should be correct.

How to parse template languages in Ragel?

I've been working on a parser for simple template language. I'm using Ragel.
The requirements are modest. I'm trying to find [[tags]] that can be embedded anywhere in the input string.
I'm trying to parse a simple template language, something that can have tags such as {{foo}} embedded within HTML. I tried several approaches to parse this but had to resort to using a Ragel scanner and use the inefficient approach of only matching a single character as a "catch all". I feel this is the wrong way to go about this. I'm essentially abusing the longest-match bias of the scanner to implement my default rule ( it can only be 1 char long, so it should always be the last resort ).
%%{
machine parser;
action start { tokstart = p; }
action on_tag { results << [:tag, data[tokstart..p]] }
action on_static { results << [:static, data[p..p]] }
tag = ('[[' lower+ ']]') >start #on_tag;
main := |*
tag;
any => on_static;
*|;
}%%
( actions written in ruby, but should be easy to understand ).
How would you go about writing a parser for such a simple language? Is Ragel maybe not the right tool? It seems you have to fight Ragel tooth and nails if the syntax is unpredictable such as this.
Ragel works fine. You just need to be careful about what you're matching. Your question uses both [[tag]] and {{tag}}, but your example uses [[tag]], so I figure that's what you're trying to treat as special.
What you want to do is eat text until you hit an open-bracket. If that bracket is followed by another bracket, then it's time to start eating lowercase characters till you hit a close-bracket. Since the text in the tag cannot include any bracket, you know that the only non-error character that can follow that close-bracket is another close-bracket. At that point, you're back where you started.
Well, that's a verbatim description of this machine:
tag = '[[' lower+ ']]';
main := (
(any - '[')* # eat text
('[' ^'[' | tag) # try to eat a tag
)*;
The tricky part is, where do you call your actions? I don't claim to have the best answer to that, but here's what I came up with:
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' #PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* #eof(PrintTextNode);
}%%
There are a few non-obvious things:
The eof action is needed because %PrintTextNode is only ever invoked on leaving a machine. If the input ends with normal text, there will be no input to make it leave that state. Because it will also be called when the input ends with a tag, and there is no final, unprinted text node, PrintTextNode tests that it has some text to print.
The %PrintTextNode action nestled in after the ^'[' is needed because, though we marked the start when we hit the [, after we hit a non-[, we'll start trying to parse anything again and remark the start point. We need to flush those two characters before that happens, hence that action invocation.
The full parser follows. I did it in C because that's what I know, but you should be able to turn it into whatever language you need pretty readily:
/* ragel so_tag.rl && gcc so_tag.c -o so_tag */
#include <stdio.h>
#include <string.h>
static char *text_start;
%%{
machine parser;
action MarkStart { text_start = fpc; }
action PrintTextNode {
int text_len = fpc - text_start;
if (text_len > 0) {
printf("TEXT(%.*s)\n", text_len, text_start);
}
}
action PrintTagNode {
int text_len = fpc - text_start - 1; /* drop closing bracket */
printf("TAG(%.*s)\n", text_len, text_start);
}
tag = '[[' (lower+ >MarkStart) ']]' #PrintTagNode;
main := (
(any - '[')* >MarkStart %PrintTextNode
('[' ^'[' %PrintTextNode | tag) >MarkStart
)* #eof(PrintTextNode);
}%%
%% write data;
int
main(void) {
char buffer[4096];
int cs;
char *p = NULL;
char *pe = NULL;
char *eof = NULL;
%% write init;
do {
size_t nread = fread(buffer, 1, sizeof(buffer), stdin);
p = buffer;
pe = p + nread;
if (nread < sizeof(buffer) && feof(stdin)) eof = pe;
%% write exec;
if (eof || cs == %%{ write error; }%%) break;
} while (1);
return 0;
}
Here's some test input:
[[header]]
<html>
<head><title>title</title></head>
<body>
<h1>[[headertext]]</h1>
<p>I am feeling very [[emotion]].</p>
<p>I like brackets: [ is cool. ] is cool. [] are cool. But [[tag]] is special.</p>
</body>
</html>
[[footer]]
And here's the output from the parser:
TAG(header)
TEXT(
<html>
<head><title>title</title></head>
<body>
<h1>)
TAG(headertext)
TEXT(</h1>
<p>I am feeling very )
TAG(emotion)
TEXT(.</p>
<p>I like brackets: )
TEXT([ )
TEXT(is cool. ] is cool. )
TEXT([])
TEXT( are cool. But )
TAG(tag)
TEXT( is special.</p>
</body>
</html>
)
TAG(footer)
TEXT(
)
The final text node contains only the newline at the end of the file.

Resources