I am using the following Google Sheets formula to translate some fields containing HTML tags:
=GOOGLETRANSLATE(A2, "en", "de")
However, the translation results in a messed up HTML and extra spaces between tags opening or closing, especially if there are many nested tags.
For example:
<div> <p>paragraph text</p> </div>
will result in:
<div> <p> Absatztext </ P> </ Div>
Sometimes, the translator changes the tags opening and closing and put extra spaces between some attributes also the closing tags letters are in uppercase.
Issues like:
<p> Absatztext <P />
<a href = " # "> Link </ A>
Sometimes, text are added before the tag closing
<h2 Was> ist Pilates? </h2>
it should be:
<h2> Was ist Pilates? </h2>
Demo here:
https://docs.google.com/spreadsheets/d/11MOZjTknFGdwuAp6g3VUa0o5OQaW44hxN2uEvqnL3jw/edit?usp=sharing
How can I fix those problems?
try simple fix like:
=LOWER(SUBSTITUTE(GOOGLETRANSLATE(A1, "en", "de"), "/ ", "/"))
UPDATE:
=SUBSTITUTE(A1, TRIM(REGEXREPLACE(A1, "</?\S+[^<>]*>", )),
GOOGLETRANSLATE(TRIM(REGEXREPLACE(A1, "</?\S+[^<>]*>", )), "EN", "DE"))
If you don't mind doing it in a single formula and just want to solve the problem, you could try splitting it and only translating what's not an HTML tag.
Put all HTML tags and closing tags in a separate sheet, so you can check for them. I'll put mine in 'tags'!A1:B128.
Considering you have your original text in A1, you can split it up by < and >:
=SPLIT(A1,"<>")
then on a line below (or elsewhere, for me it'll be A2) you can check if the first word in each cell is found among the tags with:
NOT(COUNTIF(tags!$A$1:$B$128,INDEX(SPLIT(A2," "),1,1)))
translate it if it's true with
GOOGLETRANSLATE(A2, "en", "de")
or add the brackets back with
"<"&A2&">" so the whole formula will look like
=IF(NOT(COUNTIF(tags!$A$1:$B$128,INDEX(SPLIT(A2," "),1,1))),GOOGLETRANSLATE(A2, "en", "de"), "<"&A2&">")
then on a line below, just join the whole row back to a single cell with
=JOIN("",A3:L3)
You can hide rows 2 and 3 for convenience, or even put them on a separate sheet along with the tags. You can also add a condition not to add < and > if it's empty, so you can join up the whole row without looking at how long it is.
If you'd like to do this in a single formula, you'd have to write a script for it, as some formulas act strangely with arrayformula and sometimes are barely usable.
I think the most most convenient solution nowerdays is to use a custom JS-Function (Extensions >> AppScripts):
var spanish = LanguageApp.translate('This is a <strong>test</strong>',
'en', 'es', {contentType: 'html'});
// The code will generate "Esta es una <strong>prueba</strong>".
LanguageApp.translate (apidoc) accepts as fourth option a contentType, which can be text or html.
For huge tables be aware that there are daily limits (quotas)!
Related
I'm trying to import just the text from a div on a client's site into a Google sheets using the =IMPORTXML function so they can see everything on one sheet. The problem is that some pages have href tags wrapped around text, which if using =IMPORTXML(website, "[xpath]/text") gives me an error about the array overwriting the next cell. So I tried some of the tricks around the web (wrapping in =REGEXREPLACE, =JOIN, etc) and those got me the text of the div minus the text of the children.
For example, if I have this HTML
<div class="text">
I want to get this text and
this text, too
so what do I do?
</div>
In my sheet I get "I want to get this text and so what do I do?"
I found the solution for anyone else trying to do this:
=JOIN(CHAR(10),IMPORTXML("site","//div[#class='divclass']//text()"))
My mistake was using /text() instead of //text() The extra / was missing.
I think I might just be over complicating things instead of keeping it simple.
My question is: I want to capture the title of a blog post into a prop variable, and the author who wrote it into another prop variable.
My thought would be to create a page load rule focusing only on the path of /blog. From there I would scrape the page looking for the class that defines it, and then pass it into my prop through DTM.
<div class="field field-name-title">
<h2>Online Education</h2>
<div class="field field-name-body">
<p>
<em> by Author Name</em>
</p>
</div>
</div>
I create a page rule pick my prop and set it as: div.field.field-name-title.innerText But when I set it, all I'm seeing being passed is the "div.field.field-name-title.innerText"
Am I tackling this in the wrong way?
The values you enter in a text field are literal, with the exception of %data_element% syntax, which signifies a reference to a Data Element (there are a couple of other built-in variable references, as well).
Point is, if you want to populate your Adobe Analytics variable from scraping page content, you need to create a Data Element that returns the desired value, and then reference the Data Element in the text field for the Adobe Analytics variable.
That aside, your selector is wrong. What you've done is some weird mix of css selector and javascript syntax.
Below is an example of what you can do, based on your posted HTML:
<div class="field field-name-title">
<h2>Online Education</h2>
<div class="field field-name-body">
<p>
<em> by Author Name</em>
</p>
</div>
</div>
Data Element: Article Title
First, create a Data Element to get the article title from the page, based on your html structure.
Go to Rules > Data Elements > Create New Data Element
Fill out the fields with the following:
Name: article_title
Type: CSS Selector
CSS Selector Chain: div.field-name-title h2
get the value of: text
[X] Scrub whitespace and linebreaks using cleanText
Then, click Save Changes
Data Element: Article Author
Next, create another Data Element to get the article author from the page, based on your html structure.
Go to Rules > Data Elements > Create New Data Element
Fill out the fields with the following:
Name: article_author
Type: CSS Selector
CSS Selector Chain: div.field-name-body em
get the value of: text
[X] Scrub whitespace and linebreaks using cleanText
Then, click Save Changes
Page Load Rule: Populate Variables
Finally, within the various form fields of your Page Load Rule, you can now reference your Data Elements with %data_element_name% syntax.
Tip: Once you start typing the Data Element name out (starting with % prefix), DTM will show an auto-complete dialog, listing Data Elements matched.
If you need to reference the Data Element within a javascript custom code box within the Page Load Rule, you can use the following syntax:
_satellite.getVar('data_element_name');
Where 'data_element_name' is the name of your Data Element.
Example:
s.prop1 = _satellite.getVar('article_title');
Note: Unlike the form field syntax, you should not wrap your Data Element's name with %
For example this HTML
<div>
<span></span> I want to find this <b>this works ok</b>.
</div>
I want to find a DIV with I want to find this in it and then grab the whole text inside that DIV including child elements
My XPATH, //*[contains(text(), 'I want to find this')] does not work at all.
If I do this //*[contains(text(), 'this works')] it works but I want to find any DIV based on I want to find this text
However, if I remove the <span></span> from that HTML, it works, why is that?
text() only gets the text before the first inner element. You can replace it with . to use the current node to search.
//div[contains(., 'I want to find this')]
This will search in a string concatenation of all text nodes inside the current node.
To grab all text you can use node.itertext() to iterate all inner texts if you are using lxml:
from lxml import etree
html = """
<div>
<span></span> I want to find this <b>this works ok</b>.
</div>
"""
root = etree.fromstring(html, etree.HTMLParser())
for div in root.xpath('//div[contains(., "I want to find this")]'):
print(''.join([x for x in div.itertext()]))
# => I want to find this this works ok.
Try using //*[text()=' I want to find this '] , this will select the div tag and then for text you can use the getText() method to get the text
You can try Replace text() with string():
//div[contains(string(), " I want to find this")]
Or, you can check that span's following text sibling contains the text:
//div[contains(span/following-sibling::text(), " I want to find this")]
I have comments section in my application where users enter input in a text area. I want to prevent the line breaks they enter but also display html as a string. For example, if comment.body is
Hello, this is the code: <a href='foo'>foo</a>
Bye
I want it to be displayed just as above. The same with anything else, including iframe tags.
The closest I got is:
= simple_format(comment.body)
but it sanitizes html code and it's not displayed. Example: foo <iframe>biz</iframe> bar is displayed as:
foo biz bar
What should I do to achieve what I want?
Just use it without any method, it will be rendered as plain text:
= comment.body
Using your second example, the output will be:
foo <iframe>biz</iframe> bar
To make \n behave as <br>, you can use CSS:
.add-line {
white-space: pre-wrap;
}
And use it in your view:
.add-line = comment.body
Using your first example:
comment.body = "Hello, this is the code: <a href='foo'>foo</a>\n\nBye"
The output will be:
Hello, this is the code: <a href='foo'>foo</a>
Bye
Having done something similar in the past, I think you must first understand why HTML is sanitized from user input.
Imagine I wrote the following into a field that accepted HTML and displays this to the front page.
<script>alert('Hello')</script>
The code would execute for anyone visiting the front-page and annoyingly trigger a JS alert for every visitor.
Maybe not much of an issue yet, but imagine I wrote some AJAX request that sent user session IDs to my own server. Now this is an issue... because people's sessions are being hijacked.
Furthermore, there is a full JavaScript based exploitation framework called BeEF that relies on this type of website exploit called Cross-site Scripting (XSS).
BeEF does extremely scary stuff and is worth taking a look at when considering user generated HTML.
http://guides.rubyonrails.org/security.html#cross-site-scripting-xss
So what to do? Well if you checked in your DB you'd see that the tags are actually being stored, but like you pointed out aren't displayed.
You could .html_safe the content, but again I strongly advise against this.
Maybe instead you should write an alternative .html_safe method yourself, something like html_safe_whitelisted_tags.
As for removing newlines, you say you want to display as is. So replacing /n with <br>, as pointed out by Michael, would be the solution for you.
comment.body.gsub('\n', '<br />').html_safe_whitelisted_tags
HTML safe allows the html in the comment to be used as html, but would skip the newlines, so doing a quick replace of \n with <br /> would cover the new lines
comment.body.gsub("\n", "<br />").html_safe
If you want the html to be displayed instead of rendered then checkout CGI::escapeHTML(), then do the gsub so that the <br /> does not get escaped.
CGI::escapeHTML(comment.body).gsub("\n", "<br />")
I want to nest a element inside a form_for label tag. I want to do this so I can target a specific portion of the label with CSS rules, in this case to make the text red. From some quick reading, this does appear to be valid HTML, and it fits with my design even though the idea is not playing happily with Rails.
The desired html output is like this:
<label for="zip">ZIP Code -<span class="required">Required</span></label>
My current code looks like this:
<%= form.label :zip, 'ZIP Code -<span class="required">Required</span>' %>
The problem is that Rails is somehow escaping the inner span tag so that it appears as text on the page instead of HTML. I see this on the page:
ZIP Code -<span class="required">Required</span>
Rails3 automatically escapes strings. You need to call #html_safe on the string you're putting in the label. See http://yehudakatz.com/2010/02/01/safebuffers-and-rails-3-0/ for details.