DOMDocument - How to get all inner text except from style/script tags? - parsing

I spent so much time on a very simple thing and had to post here on StackOverflow
I want to get all inner text except the script/style tags
$doc = new DOMDocument;
$doc->preserveWhiteSpace = false;
$html = <<<EOD
<div>
<script>var main=0</script>
<div>
<p>my</p>
<script>var inner=0</script>
</div>
<p>text</p>
only
</div>
EOD;
$doc->loadHTML($html);
$xpath = new DOMXPath($doc);
echo $entries = $xpath->query('//*[not(self::script)]')->item(0)->nodeValue;
gives me
var main=0 my var inner=0 text only
and also tried
$entries = $xpath->query('//*[not(self::script)]');
foreach ($entries as $entry) {
if ($entry->tagName == 'style' || $entry->tagName == 'script') {
continue;
}
echo preg_replace('/\s\s+/', ' ', $entry->nodeValue);
}
gives me
var main=0 my var inner=0 text only var main=0 my var inner=0 text only var main=0 my var inner=0 text only my var inner=0mytext
I tried several xpaths but it doesn't work
my desired output is my text only
I am a Scrapy developer and I do that easily in Scrapy, but having a bad time with PHP today

Unfortunately, PHP doesn't support xpath 2.0 (and, IIRC, neither does Scrapy), so the name() method which would have made it easy, isn't available...
The closest thing I can think of is the following, which should get you close enough (note that, because there is no <style> tag in your $html, I only focused on <script>):
$entries = $xpath->query('//*[not(./text()/parent::script)]/text()');
foreach ($entries as $entry) {
echo trim($entry->textContent) . " ";
}
Output:
my text only

Related

Extract visual text from Google Classic Site page using Apps Script in Google Sheets

I have about 5,000 Classic Google Sites pages that I need to have a Google Apps script under Google Sheets examine one by one, extract the data, and enter that data into the Google Sheet row by row.
I wrote an apps script to use one of the sheets called "Pages" that contains the exactly URL of each page row by row, to run down while doing the extraction.
That in return would get the HTML contents and I would then use regex to extract the data I want which is the values to the right of each of the following...
Job name
Domain owner
Urgency/Impact
ISOC instructions
Which would then write that date under the proper columns in the Google Sheet.
This worked except for one big problem. The HTML is not consistent. Also, ID's and tags were not used so really it makes trying to do this through SitesApp.getPageByUrl not possible.
Here is the code I came up with for that attempt.
function startCollection () {
var masterList = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Pages");
var startRow = 1;
var lastRow = masterList.getLastRow();
for(var i = startRow; i <= lastRow; i++) {
var target = masterList.getRange("A"+i).getValue();
sniff(target)
};
}
function sniff (target) {
var pageURL = target;
var pageContent = SitesApp.getPageByUrl(pageURL).getHtmlContent();
Logger.log("Scraping: ", target);
// Extract the job name
var JobNameRegExp = new RegExp(/(Job name:<\/b><\/td><td style='text-align:left;width:738px'>)(.*?)(\<\/td>)/m);
var JobNameValue = JobNameRegExp.exec(pageContent);
var JobMatch = JobNameValue[2];
if (JobMatch == null){
JobMatch = "NOTE FOUND: " + pageURL;
}
// Extract domain owner
var DomainRegExp = new RegExp(/(Domain owner:<\/b><\/td><td style='text-align:left;width:738px'><span style='font-family:arial,sans,sans-serif;font-size:13px'>)(.*?)(<\/span>)/m);
var DomainValue = DomainRegExp.exec(pageContent);
Logger.log("DUMP1:",SitesApp.getPageByUrl(pageURL).getHtmlContent());
var DomainMatch = DomainValue[2];
if (JobMatch == null){
DomainMatch = "N/A";
}
// Extract Urgency & Impact
var UrgRegExp = new RegExp(/(Urgency\/Impact:<\/b><\/td><td style='text-align:left;width:738px'>)(.*?)(<\/td>)/m);
var UrgValue = UrgRegExp.exec(pageContent);
var UrgMatch = UrgValue[2];
if (JobMatch == null){
UrgMatch = "N/A";
}
// Extract ISOC Instructions
var ISOCRegExp = new RegExp(/(ISOC instructions:<\/b><\/td><td style='text-align:left;width:738px'>)(.*?)(<\/td>)/m);
var ISOCValue = ISOCRegExp.exec(pageContent);
var ISOCMatch = ISOCValue[2];
if (JobMatch == null){
ISOCMatch = "N/A";
}
// Add record to sheet
var row_data = {
Job_Name:JobMatch,
Domain_Owner:DomainMatch,
Urgency_Impact:UrgMatch,
ISOC_Instructions:ISOCMatch,
};
insertRowInTracker(row_data)
}
function insertRowInTracker(rowData) {
var sheet = SpreadsheetApp.getActiveSpreadsheet().getSheetByName("Jobs");
var rowValues = [];
var columnHeaders = sheet.getDataRange().offset(0, 0, 1).getValues()[0];
Logger.log("Writing to the sheet: ", sheet.getName());
Logger.log("Writing Row Data: ", rowData);
columnHeaders.forEach((header) => {
rowValues.push(rowData[header]);
});
sheet.appendRow(rowValues);
}
So for my next idea, I have thought about using UrlFetchApp.fetch. The one problem I have though is that these pages on that Classics Google Site sit behind a non-shared with the public domain. While using SitesApp.getPageByUrl has the script ask for authorization and works, SitesApp.getPageByUrl does not meaning when it tries to call the direct page, it just gets the Google login page.
I might be able to work around this and turn them public, but I am still working on that.
I am running out of ideas fast on this one and hoping there is another way I have not thought of or seen. What I would really like to do is not even mess with the HTML content. I would like to use apps script under the Google Sheet to just look at the actual data presented on the page and then match a text and capture the value to the right of it.
For example have it go down the list of URLS on sheet called "Pages" and do the following for each page:
Find the following values:
Find the text "Job name:", capture the text to the right of it.
Find the text "Domain owner:", capture the text to the right of it.
Find the text "Urgency/Impact:", capture the text to the right of it.
Find the text "ISOC instructions:", capture the text to the right of it.
Write those values to a new row in sheet called "Jobs" as seen below.
Then move on the the next URL in the sheet called "Pages" and repeat until all rows in the sheet "Pages" have been completed.
Example of the data I want to capture
I have created an exact copy of one of the pages for testing and is public.
https://sites.google.com/site/2020dump/test
An inspect example
The raw HTML of the table which contains all the data I am after.
<tr>
<td style="width:190px"><b>Domain owner:</b></td>
<td style="text-align:left;width:738px">IT.FinanceHRCore </td>
</tr>
<tr>
<td style="width:190px"> <b>Urgency/Impact:</b></td>
<td style="text-align:left;width:738px">Medium (3 - Urgency, 3 - Impact) </td>
</tr>
<tr>
<td style="width:190px"><b>ISOC instructions:</b></td>
<td style="text-align:left;width:738px">None </td>
</tr>
<tr>
<td style="width:190px"></td>
<td style="text-align:left;width:738px"> </td>
</tr>
</tbody>
</table>
Any examples of how I can accomplish this? I am not sure how from an apps script perspective to go about not looking at HTML and only looking at the actual data displayed on the page. For example looking for the text "Job name:" and then grabbing the text to the right of it.
The goal at the end of the day is to transfer the data from each page into one big Google Sheet so we can kill off the Google Classic Site.
I have been scraping data with apps script using regular expressions for a while, but I will say that the formatting of this page does make it difficult.
A lot of the pages that I scrape have tables in them so I made a helper script that will go through and clean them up and turn them into arrays. Copy and paste the script below into a new google script:
function scrapetables(html,startingtable,extractlinksTF) {
var totaltables = /<table.*?>/g
var total = html.match(totaltables)
var tableregex = /<table[\s\S]*?<\/table>/g;
var tables = html.match(tableregex);
var arrays = []
var i = startingtable || 0;
while (tables[i]) {
var thistable = []
var rows = tables[i].match(/<tr[\s\S]*?<\/tr>/g);
if(rows) {
var j = 0;
while (rows[j]) {
var thisrow = tablerow(rows[j])
if(thisrow.length > 2) {
thistable.push(tablerow(rows[j]))
} else {thistable.push(thisrow)}
j++
}
arrays.push(thistable);
}
i++
}
return arrays;
}
function removespaces(string) {
var newstring = string.trim().replace(/[\r\n\t]/g,'').replace(/ /g,' ');
return newstring
}
function tablerow(row,extractlinksTF) {
var cells = row.match(/<t[dh][\s\S]*?<\/t[dh]>/g);
var i = 0;
var thisrow = [];
while (cells[i]) {
thisrow.push(removehtmlmarkup(cells[i],extractlinksTF))
i++
}
return thisrow
}
function removehtmlmarkup(string,extractlinksTF) {
var string2 = removespaces(string.replace(/<\/?[A-Za-z].*?>/g,''))
var obj = {string: string2}
//check for link
if(/<a href=.*?<\/a>/.test(string)) {
obj['link'] = /<a href="(.*?)"/.exec(string)[1]
}
if(extractlinksTF) {
return obj;
} else {return string2}
}
Running this got close, but at the moment, this doesn't handle nested tables well so I cleaned up the input by sending only the table that we want by isolating it with a regular expression:
var tablehtml = /(<table[\s\S]{200,1000}Job Name[\s\S]*?<\/table>)/im.exec(html)[1]
Your parent function will then look like this:
function sniff(pageURL) {
var html= SitesApp.getPageByUrl(pageURL).getHtmlContent();
var tablehtml = /(<table[\s\S]{200,1000}Job Name[\s\S]*?<\/table>)/im.exec(html)[1]
var table = scrapetables(tablehtml);
var row_data =
{
Job_Name: na(table[0][3][1]), //indicates the 1st table in the html, row 4, cell 2
Domain_Owner: na(table[0][4][1]), // indicates 1st table in the html, row 5, cell 2 etc...
Urgency_Impact: na(table[0][5][1]),
ISOC_Instructions: na(table[0][6][1])
}
insertRowInTracker(row_data)
}
function na(string) {
if(string) {
return string
} else { return 'N/A'}
}

Swift - Split text based on arabic combined characters

Dears,
I have arabic sentence like this stentence
أكل الولد التفاحة
how can i split the sentence based on UNCONNECTED characters to be like this :
أ-
كل
ا-
لو-
لد
ا-
لتفا-
حة
I put - to explain what i mean.
I just need to split the text into array based on that
How can i do that using swift code for ios ?
Update:
I dont care for the spaces.
"أكل" for example is one word and doesn't contain spaces.I want to split based on UNCONNECTED characters.
So "أكل" consist from two objects : "أ" and "كل"
الولد : three objects "ا" and "لو" and "لد"
Use the below code:
let a = "أكل الولد التفاحة".split(separator: " ")
You can replace spaces with "-" using replacing occurences function.
let text = "أكل الولد التفاحة".replacingOccurrences(of: " ", with: "-", options: NSString.CompareOptions.literal, range: nil) ?? ""
I don't know how accepted answer helps to fix the issue.
Apple already provided Natural Language Framework to handle such a things which more trustworthy
When you work with natural language text, it’s often useful to tokenize the text into individual words. Using NLTokenizer to enumerate words, rather than simply splitting components by whitespace, ensures correct behavior in multiple scripts and languages. For example, neither Chinese nor Japanese uses spaces to delimit words.
Here is example
let text = """
All human beings are born free and equal in dignity and rights.
They are endowed with reason and conscience and should act towards one another in a spirit of brotherhood.
"""
let tokenizer = NLTokenizer(unit: .word)
tokenizer.string = text
tokenizer.enumerateTokens(in: text.startIndex..<text.endIndex) { tokenRange, _ in
print(text[tokenRange])
return true
}
Here is link of Apple docs
Hope it is helpful
There is two box you can just click in first. Content automatically paste click convert. Output data automatically copied with spaces I used for this quran
<h1>Allah</h1>
<center>
<textarea id="field" onclick="paste(this)" style="font-size: xxx-large;min-width: 90%; min-height: 200px;"> </textarea>
<center>
</center>
</br>
<textarea id="field2" style="font-size: xxx-large;min-width: 95%; min-height: 200px;"> </textarea>
</center>
<center>
<br>
<button onclick="myFunction()" style="font-size: xx-large;min-width: 20%;">Convert</button>
</center>
<script >
function myFunction(){
var string = document.getElementById("field").value;
// Option 1
string.split('');
// Option 2
console.log(string);
// Option 3
Array.from(string);
// Option 4
var bb = Object.assign([], string);
console.log(bb);
cleanArray = bb.filter(function () { return true });
var filtered = bb.filter(function (el) {
return el != null; });
console.log(filtered);
var bb = bb.toString();
console.log(bb);
bb = bb.replace(",","");
var stringWithoutCommas = bb.replace(/,/g, ' ');
console.log(stringWithoutCommas);
document.execCommand(stringWithoutCommas)
document.getElementById("field2").value = stringWithoutCommas;
var copyTextarea = document.querySelector('#field2');
copyTextarea.focus();
copyTextarea.select();
try {
var successful = document.execCommand('copy');
var msg = successful ? 'successful' : 'unsuccessful';
console.log('Copying text command was ' + msg);
} catch (err) {
console.log('Oops, unable to copy');
}
};
/*
var copyTextareaBtn = document.querySelector('#newr');
copyTextareaBtn.addEventListener('click', function(event) {
var copyTextarea = document.querySelector('#field2');
copyTextarea.focus();
copyTextarea.select();
try {
var successful = document.execCommand('copy');
var msg = successful ? 'successful' : 'unsuccessful';
console.log('Copying text command was ' + msg);
} catch (err) {
console.log('Oops, unable to copy');
}
});
*/
async function paste(input) {
document.getElementById("field2").value = "";
const text = await navigator.clipboard.readText();
input.value = text;
}
</script>
Try this:
"أكل الولد التفاحة".map {String($0)}

How to trim <text> from MVC cshtml

Using MVC, I have the following if statement in the javascript portion of my CSHTML:
var url = '#if (#HttpContext.Current.Session["id"].ToString() == "1")
{
<text>Testing one two three</text>
}
else
{
#Url.Action("GetCustomer", "Customer")
}';
If I go into the ELSE portion, everything is fine and the following is produced:
var url = '/Customer/GetCustomer';
However, if I go into the IF portion, I am getting too much white space:
var url = '
Testing one two three ';
My question is, how can I trim that extra white space out and show as follows:
var url = 'Testing one two three';
Thank You before hand.
Thanks to Ciubotariu Florin here is the answer:
var url = '#(HttpContext.Current.Session["id"].ToString() == "1" ? "Testing one two three" : #Url.Action("GetCustomer", "Customer") )';
Remove the <text></text> tags:
var url = '#(HttpContext.Current.Session["id"].ToString() == "1" ? "Testing one two three" : #Url.Action("GetCustomer", "Customer") )';

using katex, '&' alignment symbol displays as 'amp;'

I am using katex to render math.
https://github.com/Khan/KaTeX
Generally, to get this to work I link to the files katex.min.js and katex.min.css from a cdn, which is one of the ways the directions suggest.
I wrap what needs to be rendered in tags and give all the same class. For example:
<span class='math'>\begin{bmatrix}a & b \\c & d\end{bmatrix}</span>
And inside a script tag I apply the following:
var math = document.getElementsByClassName('math');
for (var i = 0; i < math.length; i++) {
katex.render(math[i].innerHTML, math[i]);
}
So, my implementation works but there is a problem in what katex returns. The output of the above gives me:
This exact same question is asked here:
https://github.com/j13z/reveal.js-math-katex-plugin/issues/2
But I can't understand any of it.
The solution is to use element.textContent, not element.innerHTML.
If I use a form like what follows, the matrix will be rendered properly.
var math = document.getElementsByClassName('math');
for (var i = 0; i < math.length; i++) {
katex.render(math[i].textContent, math[i]); // <--element.textContent
}
A solution that works for me is the following (it is more of a hack rather than a fix):
<script type="text/javascript">
//first we define a function
function replaceAmp(str,replaceWhat,replaceTo){
replaceWhat = replaceWhat.replace(/[-\/\\^$*+?.()|[\]{}]/g, '\\$&');
var re = new RegExp(replaceWhat, 'g');
return str.replace(re,replaceTo);
}
//next we use this function to replace all occurences of 'amp;' with ""
var katexText = $(this).html();
var html = katex.renderToString(String.raw``+katexText+``, {
throwOnError: false
});
//hack to fix amp; error
var amp = '<span class="mord mathdefault">a</span><span class="mord mathdefault">m</span><span class="mord mathdefault">p</span><span class="mpunct">;</span>';
var html = replaceAmp(html, amp, "");
</script>
function convert(input) {
var input = input.replace(/amp;/g, '&'); //Find all 'amp;' and replace with '&'
input=input.replace(/&&/g, '&'); //Find all '&&' and replace with '&'. For leveling 10&x+ &3&y+&125&z = 34232
var html = katex.renderToString(input, {
throwOnError: false});
return html
}
Which version are you using?
Edit the src/utils.js and comment line number 51 to 55 after updated run in terminal npm run build command.

Get code between tags and generate youtube embed

I have some tekst and in the middle of article I put {youtube}IPtv14q9ZDg{/youtube}. How to make code which is between {youtube} generated in youtube embed
I use PHP, but there might be other options.
I filter the text for the key-words. Then take the 11 digit code and wrap it in a link tag. Works best in a "for loop".
This is one I use to find url's in my text and make them live. But you can modify it to do what you want by changing the "preg_match" setting.
function make_clickable($string) {
$string = preg_replace("/[\n\r]/"," <br /> ",$string);
$arr = explode(' ', $string);
foreach($arr as $key => $value){
if(preg_match('#((^https?|http|ftp)://(\S*?\.\S*?))([\s)\[\]{},;"\':<]|\.\s|$)#i', $value)){
$arr[$key] = "<a class=\"custome\" href='". $value ."' target=\"_blank\" class='link'>$value</a> ";
}
}
$string = implode(' ', $arr);
return $string;
}

Resources