Text Filter localization

Text Filter localization - blackberry

if i create a text filter in the following manner
class AlphaTextFilter extends TextFilter {
public char convert(char c, int status) {
if (!validate(c))
return 0;
return c;
}
public boolean validate(char c) {
return ((c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z'));
}
}
will it work if the code is run on a device with some language other than english...? how can i make a localized....?

What about using something like CharacterUtilities.isLetter() to check if it's a letter?

That depends on what your definition of "work" is. If by "work" you mean accepts only [a-zA-Z], then yes.
If you mean would it meet the expectations of international users who expect to be using alphabetic, syllabic or ideographic characters (or "letters" in more common usage), then no. Most platforms now offer some mechanism for identifying the unicode character class, and most newer regex engines support matching based on unicode character classes. In your case, \p{letter} would appear to be the most appropriate character class.

Yes, I believe so.
A char is a char no matter what locale you're in. It's all Unicode anyway, so there shouldn't be any sort of localisation issues.

Related

Can't identify the invalid identifier in my FLEX code

I am trying to build my own compiler which outputs the type of input the user gives, for example, abcd is an identifier, and 1242 is an integer. I have implemented it as below:
textProg.l
%{
#define IDENTIFIER 10
#define INTEGER 11
%}
IDENTIFIER [a-zA-Z_][a-zA-Z0-9_]*
INTEGER [1-9][0-9]*|"0"
%%
{IDENTIFIER} { return IDENTIFIER; }
{INTEGER} { return INTEGER; }
%%
int main() {
int token;
while(token = yylex()) {
if(token == IDENTIFIER) { printf("IDENTIFIER"); }
else if(token == INTEGER) { printf("INTEGER"); }
else { printf("INVALID"); }
}
}
This works perfectly when I run the following commands:
flex testProg.l
cc lex.yy.c -lfl
./a.out
Sample working input
sample
IDENTIFIER
1993
INTEGER
The problem arises when I try to input an invalid token, for example 12abc. This is neither an integer nor an identifier and should output "INVALID" but it outputs:
12abc
INTEGER
IDENTIFIER
What happened is that 12 and abc are taken as separate tokens instead of one. How can I avoid this?

Many languages use lexical analysers which are perfectly happy to let 12abc be an integer followed by a identifier. Why not? If that means something in the language, then that's probably what the user meant. If it doesn't mean anything, it will trigger a syntax error, so the user will be informed.
But, OK, you want to recognise that as an error. In that case you need to recognise the erroneous input as an error, and the first step is to recognise it as a token. That's easy if you remember flex's match precedences:
[[:alpha:]_][[:alnum:]_]* { return IDENTIFIER; }
[1-9][[:digit:]]*|0 { return NUMBER; }
[[:alnum:]_]+ { return BADTOKEN; }
Note that I replaced your macros with actual patterns, using named character classes for readability, and removed the redundant quotes on "0".

Flex parses 12abc as two separate tokens because you didn't tell it it shouldn't.
Lex derivatives, like Flex, works by one very simple but effective algorithm:
They start at the position when the last token ended (or at beginning of the text) and try to find a rule that matches the most characters from this point. (If there are multiple rules that match the same number of characters, the one defined in the "*.l" file first is chosen.)
That's it. Notice there is nothing about it having to match a whole word.
That's actually a good thing. It is why in most programming languages you don't need to explicitly separate tokens. You can write things like (2+30L)/2 and the lexer for that language will figure out where each token ends, without additional hints like whitespaces. (The tokens would be (, 2, +, 30, L, ), / and 2.)
If you want to disable this fancy mechanism for the specific case of putting numbers and identifiers together, you will need to create a rule that explicitly forbids it, e.g:
{IDENTIFIER} { return IDENTIFIER; }
{INTEGER} { return INTEGER; }
[0-9A-Za-z_]+ { return ERROR; }
Notice that this new rule also matches valid identifiers and integers. However, it won't be used for them because it is under them on the rules list.

How and what does Firefox (Hunspell) do to clean text before spellchecking words?

I am trying to clean text in the exact way that Firefox does before spell checking individual words for a Firefox extension I'm building (my addon uses nspell, a JavaScript implementation of Hunspell, since Firefox doesn't expose the Hunspell instance it uses via the extension API).
I've looked at the Firefox gecko cloned codebase, i.e. in the mozSpellChecker.h file and other related files by searching for "spellcheck" but I cannot seem to find out how they are cleaning text.
Reverse engineering it has been a major PITA, I have this so far:
// cleans text and strips out unwanted symbols/patterns before we use it
// returns an empty string if content undefined
function cleanText (content, filter = true) {
if (!content) {
console.warn(`MultiDict: cannot clean falsy or undefined content: "${content}"`)
return ''
}
// ToDo: first split string by spaces in order to properly ignore urls
const rxUrls = /^(http|https|ftp|www)/
const rxSeparators = /[\s\r\n.,:;!?_<>{}()[\]"`´^$°§½¼³%&¬+=*~#|/\\]/
const rxSingleQuotes = /^'+|'+$/g
// split all content by any character that should not form part of a word
return content.split(rxSeparators)
.reduce((acc, string) => {
// remove any number of single quotes that do not form part of a word i.e. 'y'all' > y'all
string = string.replace(rxSingleQuotes, '')
// we never want empty strings, so skip them
if (string.length < 1) {
return acc
}
// for when we're just cleaning the text of punctuation (i.e. not filtering out emails, etc)
if (!filter) {
return acc.concat([string])
}
// filter out emails, URLs, numbers, and strings less than 2 characters in length
if (!string.includes('#') && !rxUrls.test(string) && isNaN(string) && string.length > 1) {
return acc.concat([string])
}
return acc
}, [])
}
But I'm still seeing big differences between content when testing things like - well - the text area used to create this question.
To be clear: I'm looking for the exact method(s) and matches and rules that Firefox uses to clean text, and since it's open source it should be somewhere, but I can't seem to find it!

I believe you want the functions in mozInlineSpellWordUtil.cpp.
From the header:
/**
* This class extracts text from the DOM and builds it into a single string.
* The string includes whitespace breaks whereever non-inline elements begin
* and end. This string is broken into "real words", following somewhat
* complex rules; for example substrings that look like URLs or
* email addresses are treated as single words, but otherwise many kinds of
* punctuation are treated as word separators. GetNextWord provides a way
* to iterate over these "real words".
*
* The basic operation is:
*
* 1. Call Init with the weak pointer to the editor that you're using.
* 2. Call SetPositionAndEnd to to initialize the current position inside the
* previously given range and set where you want to stop spellchecking.
* We'll stop at the word boundary after that. If SetEnd is not called,
* we'll stop at the end of the root element.
* 3. Call GetNextWord over and over until it returns false.
*/
You can find the complete source here, but it is fairly complex. For example, here is the method used to classify parts of the text as email addresses or urls, but it's over 50 lines long just to handle that.
Writing a spell checker seems trivial in principal, but as you can see from the source, it is a major endeavor. I'm not saying you shouldn't try, but as you've likely discovered, the devil is in the details of the edge cases.
Just as one example, when you're deciding what constitutes a word boundary or not, you have to decide which characters to ignore, including characters outside of the ASCII range. For example, here you can see the MONGOLIAN TODO SOFT HYPHEN being handled like the ASCII hyphen character:
// IsIgnorableCharacter
//
// These characters are ones that we should ignore in input.
inline bool IsIgnorableCharacter(char ch) {
return (ch == static_cast<char>(0xAD)); // SOFT HYPHEN
}
inline bool IsIgnorableCharacter(char16_t ch) {
return (ch == 0xAD || // SOFT HYPHEN
ch == 0x1806); // MONGOLIAN TODO SOFT HYPHEN
}
Again, I'm not trying to dissuade you from working on this project, but tokenizing text into discrete words in a way that will work within the context of HTML and in a multilingual environment, is a major endeavor.

Objective-C some special char uncontrollably changing

I have a string that include some special char (like é,â,î,ı etc.), When I use substring on this string. I encounter inconsistent results. Some special char change uncontrollably

You are assuming that these are all characters:
[newword substringWithRange:NSMakeRange(0,1)];
[newword substringWithRange:NSMakeRange(1,1)];
[newword substringWithRange:NSMakeRange(2,1)];
[newword substringWithRange:NSMakeRange(3,1)];
// and so on...
In other words, you believe that:
A location always falls at the start of a character.
A character always has length 1.
Both assumptions are wrong. Please read the Characters and Grapheme Clusters chapter of Apple's String Programming Guide (here).
Your é happens to have length 2, because it is a base letter e followed by a combining diacritical accent. If you want it to have length 1, you need to normalize the string before you use it. Call precomposedStringWithCanonicalMapping and use the resulting string.
Example and proof (in Swift, but it won't matter, as I use NSString throughout):
let s = "é,â,î,ı" as NSString
let c = s.substring(with: NSRange(location: 0, length: 1)) // e
let s2 = s.precomposedStringWithCanonicalMapping as NSString
let c2 = s2.substring(with: NSRange(location: 0, length: 1)) // é

You're treating a unicode string like a sequence of bytes. Unicode codepoints, aside from low UTF8 can be multi-byte so you are changing the text style by stripping out parts responsible for the accent above the letter like this part: https://www.compart.com/en/unicode/U+0301
UTF8 is variable width so by treating it as raw bytes you may get weird results, I would suggest using something that is more aware of unicode like ICU (International Components for Unicode).
Now imagine you have a two byte sequence like this (this may not be 100% accurate but it illustrates my point):
0x056 0x000
e NUL
Now you have a UTF8 string with 1 codepoint and a null terminator. Now say you want to add an accent to that e. How would you do that? You could use a special unicode codepoint to modify the e so now the string is:
0x056 0x0CC 0x810 0x000
e U+0301 NUL
Where U+0301 is 2 a byte control character (Combining Acute Accent) and makes the e accented.
Edit: The answer assumes UTF8 encoding which is likely a bad assumption but I think the answer, whether UTF8 or UTF16, or any other type of encoding with control characters, illustrates why you may have mysterious dissapearing accents. While this may be UTF16, for the sake of simplicity let's pretend we live in a world where life is just slightly better because everyone only uses UTF8 and UTF16 doesn't exist.
To address the comment (this is less to do with the question but is some fun trivia) and for some fun detils about NS/CF/Swift runtimes and bridging and constant CF strings and other fun stuff like that: The representation of the actual string in memory is implementation defined and can vary (even for constant strings, trust me, I know, I fixed the ELF implementation of them in Clang for CoreFoundation a few days ago). Anyway, here's some code:
CF_INLINE CFStringEncoding __CFStringGetSystemEncoding(void) {
if (__CFDefaultSystemEncoding == kCFStringEncodingInvalidId) (void)CFStringGetSystemEncoding();
return __CFDefaultSystemEncoding;
}
CFStringEncoding CFStringFileSystemEncoding(void) {
if (__CFDefaultFileSystemEncoding == kCFStringEncodingInvalidId) {
#if DEPLOYMENT_TARGET_MACOSX || DEPLOYMENT_TARGET_EMBEDDED || DEPLOYMENT_TARGET_EMBEDDED_MINI || DEPLOYMENT_TARGET_WINDOWS
__CFDefaultFileSystemEncoding = kCFStringEncodingUTF8;
#else
__CFDefaultFileSystemEncoding = CFStringGetSystemEncoding();
#endif
}
return __CFDefaultFileSystemEncoding;
}
Throughout CoreFoundation/Foundation/SwiftFoundation (Yes you never know what sort of NSString is actually the one you're holding, they usually pretend to be the same thing but under the hood depending on how you got the object you may be holding onto one of the three variations of it).
This is why code like this exists, because NS/CF(Constant)/Swift strings have implementation defined internal representation.
if (((encoding & 0x0FFF) == kCFStringEncodingUnicode) && ((encoding == kCFStringEncodingUnicode) || ((encoding > kCFStringEncodingUTF8) && (encoding <= kCFStringEncodingUTF32LE)))) {
If you want consistent behavior you have to encode the string using a specific fixed encoding instead of relying on the internal representation.

PDFBox 2.0: Overcoming dictionary key encoding

I am extracting text from PDF forms with Apache PDFBox 2.0.1, extracting the details of AcroForm fields. From a radio button field I dig up the appearance dictionary. I'm interested in the /N and /D entries (normal and "down" appearance). Like this (interactive Bean shell):
field = form.getField(fieldName);
widgets = field.getWidgets();
print("Field Name: " + field.getPartialName() + " (" + widgets.size() + ")");
for (annot : widgets) {
ap = annot.getAppearance();
keys = ap.getCOSObject().getDictionaryObject("N").keySet();
keyList = new ArrayList(keys.size());
for (cosKey : keys) {keyList.add(cosKey.getName());}
print(String.join("|", keyList));
}
The output is
Field Name: Krematorier (6)
Off|Skogskrem
Off|R�cksta
Off|Silverdal
Off|Stork�llan
Off|St Botvid
Nyn�shamn|Off
The question mark blotches should be Swedish characters "ä" or "å". Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
This sample PDF form can be downloaded here: http://www.stockholm.se/PageFiles/85478/KYF%20211%20Best%C3%A4llning%202014.pdf

Using iText RUPS I can see that the dictionary keys are encoded with ISO-8859-1 while PDFBox assumes they are Unicode, I guess.
Is there any way of decoding the keys using ISO-8859-1? Or any other way to retrieve the keys correctly?
Changing the assumed encoding
PDFBox' interpretation of the encoding of bytes in names (only names can be used as dictionary keys in PDFs) takes place in BaseParser.parseCOSName() when reading the name from the source PDF:
/**
* This will parse a PDF name from the stream.
*
* #return The parsed PDF name.
* #throws IOException If there is an error reading from the stream.
*/
protected COSName parseCOSName() throws IOException
{
readExpectedChar('/');
ByteArrayOutputStream buffer = new ByteArrayOutputStream();
int c = seqSource.read();
while (c != -1)
{
int ch = c;
if (ch == '#')
{
int ch1 = seqSource.read();
int ch2 = seqSource.read();
if (isHexDigit((char)ch1) && isHexDigit((char)ch2))
{
String hex = "" + (char)ch1 + (char)ch2;
try
{
buffer.write(Integer.parseInt(hex, 16));
}
catch (NumberFormatException e)
{
throw new IOException("Error: expected hex digit, actual='" + hex + "'", e);
}
c = seqSource.read();
}
else
{
// check for premature EOF
if (ch2 == -1 || ch1 == -1)
{
LOG.error("Premature EOF in BaseParser#parseCOSName");
c = -1;
break;
}
seqSource.unread(ch2);
c = ch1;
buffer.write(ch);
}
}
else if (isEndOfName(ch))
{
break;
}
else
{
buffer.write(ch);
c = seqSource.read();
}
}
if (c != -1)
{
seqSource.unread(c);
}
String string = new String(buffer.toByteArray(), Charsets.UTF_8);
return COSName.getPDFName(string);
}
As you can see, after reading the name bytes and interpreting the # escape sequences, PDFBox unconditionally interprets the resulting bytes as UTF-8 encoded. To change this, therefore, you have to patch this PDFBox class and replace the charset named at the bottom.
Is PDFBox correct here?
According to the specification, when treating a name object as text
the sequence of bytes (after expansion of NUMBER SIGN sequences, if any) should be interpreted according to UTF-8, a variable-length byte-encoded representation of Unicode in which the printable ASCII characters have the same representations as in ASCII.
(section 7.3.5 Name Objects, ISO 32000-1)
BaseParser.parseCOSName() implements just that.
PDFBox' implementation is not completely correct, though, as already the act of interpreting the name as string without need is wrong:
name objects shall be treated as atomic within a PDF file. Ordinarily, the bytes making up the name are never treated as text to be presented to a human user or to an application external to a conforming reader. However, occasionally the need arises to treat a name object as text
Thus, PDF libraries should handle names as byte arrays as long as possible and only find a string representation when it is explicitly required, and only then the recommendation above (to assume UTF-8) should play a role. The specification even indicates where this may cause trouble:
PDF does not prescribe what UTF-8 sequence to choose for representing any given piece of externally specified text as a name object. In some cases, multiple UTF-8 sequences may represent the same logical text. Name objects defined by different sequences of bytes constitute distinct name objects in PDF, even though the UTF-8 sequences may have identical external interpretations.
Another situation becomes apparent in the document at hand, if the sequence of bytes constitutes no valid UTF-8, it still is a valid name. But such names are changed by the method above, any unparsable byte or subsequence is replaced by the Unicode Replacement Character '�'. Thus, different names may collapse into a single one.
Another issue is that when writing back a PDF, PDFBox is not acting symmetrically but instead interprets the String representation of the name (which has been retrieved as a UTF-8 interpretation if read from a PDF) using pure US_ASCII, cf. COSName.writePDF(OutputStream):
public void writePDF(OutputStream output) throws IOException
{
output.write('/');
byte[] bytes = getName().getBytes(Charsets.US_ASCII);
for (byte b : bytes)
{
int current = (b + 256) % 256;
// be more restrictive than the PDF spec, "Name Objects", see PDFBOX-2073
if (current >= 'A' && current <= 'Z' ||
current >= 'a' && current <= 'z' ||
current >= '0' && current <= '9' ||
current == '+' ||
current == '-' ||
current == '_' ||
current == '#' ||
current == '*' ||
current == '$' ||
current == ';' ||
current == '.')
{
output.write(current);
}
else
{
output.write('#');
output.write(String.format("%02X", current).getBytes(Charsets.US_ASCII));
}
}
}
Thus, any interesting Unicode character is replaced with the US_ASCII default replacement character which I assume to be '?'.
So it is quite fortunate that PDF names most often do merely contain ASCII characters... ;)
Historically
According to the implementation notes from the PDF 1.4 reference,
In Acrobat 4.0 and earlier versions, a name object being treated as text will typically be interpreted in a host platform encoding, which depends on the operating system and the local language. For Asian languages, this encoding may be something like Shift-JIS or Big Five. Consequently, it will be necessary to distinguish between names encoded this way and ones encoded as UTF-8. Fortunately, UTF-8 encoding is very stylized and its use can usually be recognized. A name that is found not to conform to UTF-8 encoding rules can instead be interpreted according to host platform encoding.
Thus, the sample document at hand seems to follow conventions from Acrobat 4, i.e. from the last century.
Source code excerpts are from PDFBox 2.0.0 but at first glance do not seem to have been changed in 2.0.1 or the development trunk.

Regular Expressions in iOS [duplicate]

I'm creating a regexp for password validation to be used in a Java application as a configuration parameter.
The regexp is:
^.*(?=.{8,})(?=..*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=]).*$
The password policy is:
At least 8 chars
Contains at least one digit
Contains at least one lower alpha char and one upper alpha char
Contains at least one char within a set of special chars (##%$^ etc.)
Does not contain space, tab, etc.
I’m missing just point 5. I'm not able to have the regexp check for space, tab, carriage return, etc.
Could anyone help me?

Try this:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
Explanation:
^ # start-of-string
(?=.*[0-9]) # a digit must occur at least once
(?=.*[a-z]) # a lower case letter must occur at least once
(?=.*[A-Z]) # an upper case letter must occur at least once
(?=.*[##$%^&+=]) # a special character must occur at least once
(?=\S+$) # no whitespace allowed in the entire string
.{8,} # anything, at least eight places though
$ # end-of-string
It's easy to add, modify or remove individual rules, since every rule is an independent "module".
The (?=.*[xyz]) construct eats the entire string (.*) and backtracks to the first occurrence where [xyz] can match. It succeeds if [xyz] is found, it fails otherwise.
The alternative would be using a reluctant qualifier: (?=.*?[xyz]). For a password check, this will hardly make any difference, for much longer strings it could be the more efficient variant.
The most efficient variant (but hardest to read and maintain, therefore the most error-prone) would be (?=[^xyz]*[xyz]), of course. For a regex of this length and for this purpose, I would dis-recommend doing it that way, as it has no real benefits.

simple example using regex
public class passwordvalidation {
public static void main(String[] args) {
String passwd = "aaZZa44#";
String pattern = "(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}";
System.out.println(passwd.matches(pattern));
}
}
Explanations:
(?=.*[0-9]) a digit must occur at least once
(?=.*[a-z]) a lower case letter must occur at least once
(?=.*[A-Z]) an upper case letter must occur at least once
(?=.*[##$%^&+=]) a special character must occur at least once
(?=\\S+$) no whitespace allowed in the entire string
.{8,} at least 8 characters

All the previously given answers use the same (correct) technique to use a separate lookahead for each requirement. But they contain a couple of inefficiencies and a potentially massive bug, depending on the back end that will actually use the password.
I'll start with the regex from the accepted answer:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}$
First of all, since Java supports \A and \z I prefer to use those to make sure the entire string is validated, independently of Pattern.MULTILINE. This doesn't affect performance, but avoids mistakes when regexes are recycled.
\A(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\S+$).{8,}\z
Checking that the password does not contain whitespace and checking its minimum length can be done in a single pass by using the all at once by putting variable quantifier {8,} on the shorthand \S that limits the allowed characters:
\A(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])\S{8,}\z
If the provided password does contain a space, all the checks will be done, only to have the final check fail on the space. This can be avoided by replacing all the dots with \S:
\A(?=\S*[0-9])(?=\S*[a-z])(?=\S*[A-Z])(?=\S*[##$%^&+=])\S{8,}\z
The dot should only be used if you really want to allow any character. Otherwise, use a (negated) character class to limit your regex to only those characters that are really permitted. Though it makes little difference in this case, not using the dot when something else is more appropriate is a very good habit. I see far too many cases of catastrophic backtracking because the developer was too lazy to use something more appropriate than the dot.
Since there's a good chance the initial tests will find an appropriate character in the first half of the password, a lazy quantifier can be more efficient:
\A(?=\S*?[0-9])(?=\S*?[a-z])(?=\S*?[A-Z])(?=\S*?[##$%^&+=])\S{8,}\z
But now for the really important issue: none of the answers mentions the fact that the original question seems to be written by somebody who thinks in ASCII. But in Java strings are Unicode. Are non-ASCII characters allowed in passwords? If they are, are only ASCII spaces disallowed, or should all Unicode whitespace be excluded.
By default \s matches only ASCII whitespace, so its inverse \S matches all Unicode characters (whitespace or not) and all non-whitespace ASCII characters. If Unicode characters are allowed but Unicode spaces are not, the UNICODE_CHARACTER_CLASS flag can be specified to make \S exclude Unicode whitespace. If Unicode characters are not allowed, then [\x21-\x7E] can be used instead of \S to match all ASCII characters that are not a space or a control character.
Which brings us to the next potential issue: do we want to allow control characters? The first step in writing a proper regex is to exactly specify what you want to match and what you don't. The only 100% technically correct answer is that the password specification in the question is ambiguous because it does not state whether certain ranges of characters like control characters or non-ASCII characters are permitted or not.

You should not use overly complex Regex (if you can avoid them) because they are
hard to read (at least for everyone but yourself)
hard to extend
hard to debug
Although there might be a small performance overhead in using many small regular expressions, the points above outweight it easily.
I would implement like this:
bool matchesPolicy(pwd) {
if (pwd.length < 8) return false;
if (not pwd =~ /[0-9]/) return false;
if (not pwd =~ /[a-z]/) return false;
if (not pwd =~ /[A-Z]/) return false;
if (not pwd =~ /[%#$^]/) return false;
if (pwd =~ /\s/) return false;
return true;
}

Thanks for all answers, based on all them but extending sphecial characters:
#SuppressWarnings({"regexp", "RegExpUnexpectedAnchor", "RegExpRedundantEscape"})
String PASSWORD_SPECIAL_CHARS = "##$%^`<>&+=\"!ºª·#~%&'¿¡€,:;*/+-.=_\\[\\]\\(\\)\\|\\_\\?\\\\";
int PASSWORD_MIN_SIZE = 8;
String PASSWORD_REGEXP = "^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[" + PASSWORD_SPECIAL_CHARS + "])(?=\\S+$).{"+PASSWORD_MIN_SIZE+",}$";
Unit tested:

Password Requirement :
Password should be at least eight (8) characters in length where the system can support it.
Passwords must include characters from at least two (2) of these groupings: alpha, numeric, and special characters.
^.*(?=.{8,})(?=.*\d)(?=.*[a-zA-Z])|(?=.{8,})(?=.*\d)(?=.*[!##$%^&])|(?=.{8,})(?=.*[a-zA-Z])(?=.*[!##$%^&]).*$
I tested it and it works

For anyone interested in minimum requirements for each type of character, I would suggest making the following extension over Tomalak's accepted answer:
^(?=(.*[0-9]){%d,})(?=(.*[a-z]){%d,})(?=(.*[A-Z]){%d,})(?=(.*[^0-9a-zA-Z]){%d,})(?=\S+$).{%d,}$
Notice that this is a formatting string and not the final regex pattern. Just substitute %d with the minimum required occurrences for: digits, lowercase, uppercase, non-digit/character, and entire password (respectively). Maximum occurrences are unlikely (unless you want a max of 0, effectively rejecting any such characters) but those could be easily added as well. Notice the extra grouping around each type so that the min/max constraints allow for non-consecutive matches. This worked wonders for a system where we could centrally configure how many of each type of character we required and then have the website as well as two different mobile platforms fetch that information in order to construct the regex pattern based on the above formatting string.

This one checks for every special character :
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=\S+$).*[A-Za-z0-9].{8,}$

Java Method ready for you, with parameters
Just copy and paste and set your desired parameters.
If you don't want a module, just comment it or add an "if" as done by me for special char
//______________________________________________________________________________
/**
* Validation Password */
//______________________________________________________________________________
private static boolean validation_Password(final String PASSWORD_Arg) {
boolean result = false;
try {
if (PASSWORD_Arg!=null) {
//_________________________
//Parameteres
final String MIN_LENGHT="8";
final String MAX_LENGHT="20";
final boolean SPECIAL_CHAR_NEEDED=true;
//_________________________
//Modules
final String ONE_DIGIT = "(?=.*[0-9])"; //(?=.*[0-9]) a digit must occur at least once
final String LOWER_CASE = "(?=.*[a-z])"; //(?=.*[a-z]) a lower case letter must occur at least once
final String UPPER_CASE = "(?=.*[A-Z])"; //(?=.*[A-Z]) an upper case letter must occur at least once
final String NO_SPACE = "(?=\\S+$)"; //(?=\\S+$) no whitespace allowed in the entire string
//final String MIN_CHAR = ".{" + MIN_LENGHT + ",}"; //.{8,} at least 8 characters
final String MIN_MAX_CHAR = ".{" + MIN_LENGHT + "," + MAX_LENGHT + "}"; //.{5,10} represents minimum of 5 characters and maximum of 10 characters
final String SPECIAL_CHAR;
if (SPECIAL_CHAR_NEEDED==true) SPECIAL_CHAR= "(?=.*[##$%^&+=])"; //(?=.*[##$%^&+=]) a special character must occur at least once
else SPECIAL_CHAR="";
//_________________________
//Pattern
//String pattern = "(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}";
final String PATTERN = ONE_DIGIT + LOWER_CASE + UPPER_CASE + SPECIAL_CHAR + NO_SPACE + MIN_MAX_CHAR;
//_________________________
result = PASSWORD_Arg.matches(PATTERN);
//_________________________
}
} catch (Exception ex) {
result=false;
}
return result;
}

Also You Can Do like This.
public boolean isPasswordValid(String password) {
String regExpn =
"^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])(?=\\S+$).{8,}$";
CharSequence inputStr = password;
Pattern pattern = Pattern.compile(regExpn,Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(inputStr);
if(matcher.matches())
return true;
else
return false;
}

Use Passay library which is powerful api.

I think this can do it also (as a simpler mode):
^(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=])[^\s]{8,}$
[Regex Demo]

easy one
("^ (?=.* [0-9]) (?=.* [a-z]) (?=.* [A-Z]) (?=.* [\\W_])[\\S]{8,10}$")
(?= anything ) ->means positive looks forward in all input string and make sure for this condition is written .sample(?=.*[0-9])-> means ensure one digit number is written in the all string.if not written return false
.
(?! anything ) ->(vise versa) means negative looks forward if condition is written return false.
close meaning ^(condition)(condition)(condition)(condition)[\S]{8,10}$

String s=pwd;
int n=0;
for(int i=0;i<s.length();i++)
{
if((Character.isDigit(s.charAt(i))))
{
n=5;
break;
}
else
{
}
}
for(int i=0;i<s.length();i++)
{
if((Character.isLetter(s.charAt(i))))
{
n+=5;
break;
}
else
{
}
}
if(n==10)
{
out.print("Password format correct <b>Accepted</b><br>");
}
else
{
out.print("Password must be alphanumeric <b>Declined</b><br>");
}
Explanation:
First set the password as a string and create integer set o.
Then check the each and every char by for loop.
If it finds number in the string then the n add 5. Then jump to the
next for loop. Character.isDigit(s.charAt(i))
This loop check any alphabets placed in the string. If its find then
add one more 5 in n. Character.isLetter(s.charAt(i))
Now check the integer n by the way of if condition. If n=10 is true
given string is alphanumeric else its not.

Sample code block for strong password:
(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[^a-zA-Z0-9])(?=\\S+$).{6,18}
at least 6 digits
up to 18 digits
one number
one lowercase
one uppercase
can contain all special characters

RegEx is -
^(?:(?=.*\d)(?=.*[a-z])(?=.*[A-Z])(?=.*[##$%^&+=]).*)[^\s]{8,}$
at least 8 digits {8,}
at least one number (?=.*\d)
at least one lowercase (?=.*[a-z])
at least one uppercase (?=.*[A-Z])
at least one special character (?=.*[##$%^&+=])
No space [^\s]

A more general answer which accepts all the special characters including _ would be slightly different:
^(?=.*[0-9])(?=.*[a-z])(?=.*[A-Z])(?=.*[\W|\_])(?=\S+$).{8,}$
The difference (?=.*[\W|\_]) translates to "at least one of all the special characters including the underscore".

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Text Filter localization - blackberry

What about using something like CharacterUtilities.isLetter() to check if it's a letter?

Yes, I believe so. A char is a char no matter what locale you're in. It's all Unicode anyway, so there shouldn't be any sort of localisation issues.

Related

Can't identify the invalid identifier in my FLEX code

How and what does Firefox (Hunspell) do to clean text before spellchecking words?

Objective-C some special char uncontrollably changing

PDFBox 2.0: Overcoming dictionary key encoding

Regular Expressions in iOS [duplicate]

Categories

Resources