What is the most convenient way to remove all the HTML tags when using the SAS URL access method to read web pages?
This should do what you want. Removes everything between the <> including the <> and leaves just the content (aka innerHTML).
Data HTMLData;
filename INDEXIN URL "http://www.zug.com/";
input;
textline = _INFILE_;
/*-- Clear out the HTML text --*/
re1 = prxparse("s/<(.|\n)*?>//");
call prxchange(re1, -1, textline);
run;
I think the methodology is not to remove the HTML from the page, but identify the standard patterns for the data you are trying to capture. This is the perl / regular expressions type methodology.
An example might be some data or table that comes so many characters after the logo image. You could write a script to keep only the data.
If you want to post up some html, maybe we can help decode it.
Related
Based on this article
https://resources.infosecinstitute.com/topic/how-to-prevent-cross-site-scripting-attacks/
Reflected XXS happens when data injected is reflected in the response. I get the idea that if I, for example, have a search box in my page and the search term inputted by a user is displayed in the page, someone could write as a search term:
<script>alert('x');</script>
and that would be read as regular HTML element in the page that displays the response.
But lets say greater than and less than are already blocked in input (meaning they wouldn't be able to put in script tags or any tag), what's the issue if I allow single quote, double quote, ampersand, and backslash reflected in the response. I'm trying to make sense of it but I am not sure if I am understanding correctly.
Today the web stack is big and complex with many languages. We have HTML, CSS, JavaScript, VB-Script, SVG, URLs…
Each with its own rules for:
Encoding
Quoting
Commenting
Escaping
Also, each one can be nested inside each other:
And just replacing <> fixes some issues, but not all of them as you don't know where you data will end up, is it in HTML? as a HTML Attribute? inside a JavaScript string? Each one needs different encoding to become safe.
So, the world is a bit more complicated.....
I am wondering if it is consider good practice to encode user input to database.
Or is it ok to not encode to user input instead.
Currently my way of doing it is to encode it when entering database and use Html.DisplayFor to display it.
No. You want to keep the input in its original form until you need it and know what the output type is. It might be HTML for now, but later if you want to change it to json, text file, xml, etc the encoding might make it look different then you want.
So, first you want to make sure you are securely validating your input. It is a good idea to know what are the requirements for each of your inputs and validate that they are withing the correct length, range, character set, etc. It will be to your interest to limit the type of characters that are allowed as valid characters of an input type. (If using Regular Expressions to validate input ensure you do not use a regular expressions that is susceptible to a Regular Expression Denial of Service.
When moving the data around in your code ensure that you are properly handling the data in a manner that it will not turn into an Injection Attack.
Since you are talking about a database, the best practice is to use paramaterized statements. Check out the prevention methods in the above link.
Then when it comes outputting using MVC, if you are not using RAW or MvcHtmlString functions/calls, then the output is automatically encoded. With the automatic encoding, you want to make sure you are using the AntiXss encoder and not the default (whitelist approach vs. blacklist). Link
If you are using Raw or MvcHtmlString, you want to make sure you COMPLETE TRUST the values (you hard coded them in) or you manually encode them using the AntiXss Encoder class.
No it is not necessary to encode all the user inputs, rather if you want to avoid the script injection either you my try to validate the fields for special characters like '<', '>', '/', etc. else your Html helper method itself will do the needful.
I am writing an iOS app that downloads some data from a server that's not under my control. I am not using custom data detectors. The strings in the returned JSON still contain their HTML url tags, and I want to remove them because I want to display the strings in a UITextView, and these kind of strings
<strong>Instagram</strong> / <strong>Behance</strong>
<strong>Live Now</strong>
What I really want is this:
Instagram Behance
Live Now
What is the best way to go about this?
Should I strip the url tags from the text using regex?
Would I lose the link "descriptions" (in the above example, "Instagram" and "Behance") when I do that?
Would this be way easier using a UIWebView?
If this would be too hard/impossible, it'd be okay to only have the urls, without their descriptions.
Thank you!
Should I strip the url tags from the text using regex?
No. HTML is too complex to be properly parsed using a RegEx. You'll need an XML parser.
Would I lose the link "descriptions" (in the above example, "Instagram" and "Behance") when I do that?
You wouldn't have to using an XML parser. Using a RegEx, you might, especially if you can't control exactly what's returned.
Would this be way easier using a UIWebView?
Yep. That's what I would do, unless you have a good reason not to.
I have RTF files containing that sort of content:
long_text_description_1 number1a number1b number1c
long_text_description_2 number2a number2b number2c
long_text_description_3 number3c
long_text_description_4 number4a number4b number4c
…
I need to extract the plain raw text without the colours, fonts and other formatting thing.
The only thing I need to keep are the most basic row/column information, ideally I would like a CSV file.
The file I get contain all the formatting:
{\cs18\lang1033\langfe1033\f0\b\i0\ul0\strike0\scaps0\fs15\afs15\charscalex100\expndtw0\cf1\dn0 number1a}
What is the best way to remove all rtf information while only keeping the row information?
Trying to figure out myself many many regular expressions sound dangerous unless there is a complete understanding of the RTF format.
What I could find on the Internet mostly focused on using Windows languages & libraries unavailable in iOS.
All rtf tags are in the form \xxx.
Try using a regular expression like "\\S+" and remove all matches or replace with nothing.
For your example, you'll end up with { number1a} This will remove any backslash followed by any characters.
I am getting text from a feed that has alot of characters like:
Insignia™ 2.0 Stereo Computer Speaker System (2-Piece) - Black
4th-Generation Apple® iPod® touch
Is there an easy way to get rid of these, or do I have to anticipate which characters I want to delete and use the delete method to remove them? Also, when I try to remove
&
with
str.delete("&")
It leaves behind "amp;" Is there a better way to delete this type of character? Do I need to re-encode the text?
String#delete is certainly not what you want, as it works on characters, not the string as a whole.
Try
str.gsub /&/, ""
You may also want to try replacing the & with a literal ampersand, such as:
str.gsub /&/, "&"
If this is closer to what you really want, you may get the best results unescaping the HTML string. If so try this:
CGI::unescapeHTML(str)
Details of the unescapeHTML method are here.
If you are getting data from a 'feed', aka RSS XML, then you should be using an XML parser like Nokogiri to process the XML. This will automatically unescape HTML entities and allow you to get the proper string representation directly.
For removing try to use gsub method, something like this:
text = "foo&bar"
text.gsub /\b&\b/, "" #=> foobar