ruby/rails detect financial track data and return nil/empty string - ruby-on-rails

I read through similar stackoverflow questions to understand financial track card data.
I think the issue I am facing might be slightly different or maybe I am really weak in regex.
Now we have a service that returns track data accidentally instead of the guest name.
My goal is every time I receive track data I display "" empty string, else return the guest name.( This is a temp solution until we fix the root cause)
This is what my regular expressions is but looks like it doesn't detect track data.
irb(main):043:0> guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
irb(main):044:0> (/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/.match(guestname)) ? "" : guestname
=> "%4234242xx12^TEST/GUEST L ^324532635645744646462"
(Not real data)
Now, looking at the wiki for track data information I want to cover most cases, if not all:
https://en.wikipedia.org/wiki/Magnetic_stripe_card#Financial_cards
Could some help with my regex. This is what I have:
/[(%[bB])(;)]\d{3,}.{9,}[(^.+^)(=)].+\?.{,2}/
Track 1, Format B:
Start sentinel — one character (generally '%')
Format code="B" — one character (alpha only)
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Field Separator — one character (generally '^')
Name — 2 to 26 characters
Field Separator — one character (generally '^')
Expiration date — four characters in the form YYMM.
Service code — three characters
Discretionary data — may include Pin Verification Key Indicator (PVKI,
1 character), PIN Verification Value (PVV, 4 characters), Card
Verification Value or Card Verification Code (CVV or CVC, 3
characters)
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track.
Track 2: This format was developed by the banking industry (ABA). This
track is written with a 5-bit scheme (4 data bits + 1 parity), which
allows for sixteen possible characters, which are the numbers 0-9,
plus the six characters : ; < = > ? . The selection of six
punctuation symbols may seem odd, but in fact the sixteen codes simply
map to the ASCII range 0x30 through 0x3f, which defines ten digit
characters plus those six symbols. The data format is as follows:
Start sentinel — one character (generally ';')
Primary account number (PAN) — up to 19 characters. Usually, but not
always, matches the credit card number printed on the front of the
card.
Separator — one char (generally '=')
Expiration date — four characters in the form YYMM.
Service code — three digits. The first digit specifies the interchange
rules, the second specifies authorisation processing and the third
specifies the range of services
Discretionary data — as in track one
End sentinel — one character (generally '?')
Longitudinal redundancy check (LRC) — it is one character and a
validity character calculated from other data on the track. Most
reader devices do not return this value when the card is swiped to the
presentation layer, and use it only to verify the input internally to
the reader.

Your example input string does not contain format code after first sentinel.
You are trying to parse html-encoded version, which is weird.
So, I would start with html decoding. E.g. with Nokogiri:
▶ guestname="%4234242xx12^TEST/GUEST L ^324532635645744646462"
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
▶ parsed = Nokogiri::HTML.parse(guestname).text
#⇒ "%4234242xx12^TEST/GUEST L ^324532635645744646462"
OK, now we at least have a leading percent. Now let us ask ourselves: how many users have a guest name starting with a percent sign? I bet none. You might re-check yourself by running a query against your database. Since it is a temporary solution, I would definitely shut the perfectionism up and go with:
▶ parsed =~ /\A%/ ? '' : parsed
Hope it helps.

Related

How to specify a range in Ruby

I've been looking for a good way to see if a string of items are all numbers, and thought there might be a way of specifying a range from 0 to 9 and seeing if they're included in the string, but all that I've looked up online has really confused me.
def validate_pin(pin)
(pin.length == 4 || pin.length == 6) && pin.count("0-9") == pin.length
end
The code above is someone else's work and I've been trying to identify how it works. It's a pin checker - takes in a set of characters and ensures the string is either 4 or 6 digits and all numbers - but how does the range work?
When I did this problem I tried to use to_a? Integer and a bunch of other things including ranges such as (0..9) and ("0..9) and ("0".."9") to validate a character is an integer. When I saw ("0-9) it confused the heck out of me, and half an hour of googling and youtube has only left me with regex tutorials (which I'm interested in, but currently just trying to get the basics down)
So to sum this up, my goal is to understand a more semantic/concise way to identify if a character is an integer. Whatever is the simplest way. All and any feedback is welcome. I am a new rubyist and trying to get down my fundamentals. Thank You.
Regex really is the right way to do this. It's specifically for testing patterns in strings. This is how you'd test "do all characters in this string fall in the range of characters 0-9?":
pin.match(/\A[0-9]+\z/)
This regex says "Does this string start and end with at least one of the characters 0-9, with nothing else in between?" - the \A and \z are start-of-string and end-of-string matchers, and the [0-9]+ matches any one or more of any character in that range.
You could even do your entire check in one line of regex:
pin.match(/\A([0-9]{4}|[0-9]{6})\z/)
Which says "Does this string consist of the characters 0-9 repeated exactly 4 times, or the characters 0-9, repeated exactly 6 times?"
Ruby's String#count method does something similar to this, though it just counts the number of occurrences of the characters passed, and it uses something similar to regex ranges to allow you to specify character ranges.
The sequence c1-c2 means all characters between c1 and c2.
Thus, it expands the parameter "0-9" into the list of characters "0123456789", and then it tests how many of the characters in the string match that list of characters.
This will work to verify that a certain number of numbers exist in the string, and the length checks let you implicitly test that no other characters exist in the string. However, regexes let you assert that directly, by ensuring that the whole string matches a given pattern, including length constraints.
Count everything non-digit in pin and check if this count is zero:
pin.count("^0-9").zero?
Since you seem to be looking for answers outside regex and since Chris already spelled out how the count method was being implemented in the example above, I'll try to add one more idea for testing whether a string is an Integer or not:
pin.to_i.to_s == pin
What we're doing is converting the string to an integer, converting that result back to a string, and then testing to see if anything changed during the process. If the result is =>true, then you know nothing changed during the conversion to an integer and therefore the string is only an Integer.
EDIT:
The example above only works if the entire string is an Integer and won’t properly deal with leading zeros. If you want to check to make sure each and every character is an Integer then do something like this instead:
pin.prepend(“1”).to_i.to_s(1..-1) == pin
Part of the question seems to be exactly HOW the following portion of code is doing its job:
pin.count("0-9")
This piece of the code is simply returning a count of how many instances of the numbers 0 through 9 exist in the string. That's only one piece of the relevant section of code though. You need to look at the rest of the line to make sense of it:
pin.count("0-9") == pin.length
The first part counts how many instances then the second part compares that to the length of the string. If they are equal (==) then that means every character in the string is an Integer.
Sometimes negation can be used to advantage:
!pin.match?(/\D/) && [4,6].include?(pin.length)
pin.match?(/\D/) returns true if the string contains a character other than a digit (matching /\D/), in which case it it would be negated to false.
One advantage of using negation here is that if the string contains a character other than a digit pin.match?(/\D/) would return true as soon as a non-digit is found, as opposed to methods that examine all the characters in the string.

How to find which unicode letters look good in URL

For some examples:
These characters are too short or overlap the surrounding characters:
/b5/ີ/foo
/31/ั/foo
/39/᤹/foo
/a3/ᮣ/foo
These are too long to fit into monospace character slot:
/4b/ോ/foo
/23/ᠣ/fo
/61/ᡡ/foo
/86/ᢆ/foo
/ba/຺/foo
Then blank/whitespace/invisible characters would also be considered ones that don't fit well in the URL.
Wondering if there is a simple way to figure out which characters fall into these slots:
Fits well in URL (latin characters, chinese characters, etc.).
Too large for monospace (chinese characters, the above examples, etc.).
Combining character or overlaps surrounding URL characters (examples above).
Maybe by checking some property on the unicode character there is a way to tell this programmatically, so I don't need to go through each character individually and visually check which category it falls into.
Mainly I am looking for which characters need to be either (a) placed on another character (combining characters), or (b) need some extra padding like the examples above, so you can see them in the URL).
The problem is ill-defined. You claim that the latter five don't fit, but for me they render in one column, which is precisely according to how it's specified in Unicode. Also see: https://stackoverflow.com/a/56216985/46395
use 5.030;
use Unicode::GCString qw();
for (
"\N{WORD JOINER}", # U+2060
"\N{LATIN SMALL LETTER L}", # U+006C
"\N{CJK UNIFIED IDEOGRAPH-4E2D}", # U+4E2D
"\N{LAO VOWEL SIGN II}", # U+0EB5
"\N{THAI CHARACTER MAI HAN-AKAT}", # U+0E31
"\N{LIMBU SIGN MUKPHRENG}", # U+1939
"\N{SUNDANESE CONSONANT SIGN PANYIKU}", # U+1BA3
"\N{MALAYALAM VOWEL SIGN OO}", # U+0D4B
"\N{MONGOLIAN LETTER O}", # U+1823
"\N{MONGOLIAN LETTER SIBE U}", # U+1861
"\N{MONGOLIAN LETTER ALI GALI THREE BALUDA}", # U+1886
"\N{LAO SIGN PALI VIRAMA}", # U+0EBA
) {
say Unicode::GCString->new($_)->columns
}
__END__
0
1
2
0
0
0
0
1
1
1
1
1

Ruby: Split a string into substring of maximum 40 characters

I have some strings with a sentence and i need to subdivise it into a substring of maximum 40 characters.
But i don't want to split the sentence in the middle of a word.
I tried with .gsub function but it's return 40 characters maximum and avoid to cut the string in the middle of a word. But it's return only the first occurence.
sentence[0..40].gsub(/\s\w+$/,'')
I tried with split but i can select only the fist 40 characters and split in the middle of a word...
sentence.split(...){40}
My string is "Sure, we will show ourselves only when we know the east door has been opened.".
The string output i want is
["Sure, we will show ourselves only when we","know the east door has
been opened."]
Do you have a solution ? Thanks
Your first attempt:
sentence[0..40].gsub(/\s\w+$/,'')
almost works, but it has one fatal flaw. You are splitting on the number of characters before cutting off the last word. This means you have no way of knowing whether the bit being trimmed off was a whole word, or a partial word.
Because of this, your code will always cut off the last word.
I would solve the problem as follows:
sentence[/\A.{0,39}[a-z]\b/mi]
\A is an anchor to fix the regex to the start of the string.
.{0,39}[a-z] matches on 1 to 40 characters, where the last character must be a letter. This is to prevent the last selected character from being punctuation or space. (Is that desired behaviour? Your question didn't really specify. Feel free to tweak/remove that [a-z] part, e.g. [a-z.] to match a full stop, if desired.)
\b is a word boundary look-around. It is a zero-width matcher, on beginning/end of words.
/mi modifiers will include case insensitive (i.e. A-Z) and multi-line matches.
One very minor note is that because this regex is matching 1 to 40 characters (rather than zero), it is possible to get a null result. (Although this is seemingly very unlikely, since you'd need a 1-word, 41+ letter string!!) To account for this edge case, call .to_s on the result if needed.
Update: Thank you for the improved edit to your question, providing a concrete example of an input/result. This makes it much clearer what you are asking for, as the original post was somewhat ambiguous.
You could solve this with something like the following:
sentence.scan(/.{0,39}[a-z.!?,;](?:\b|$)/mi)
String#scan returns an array of strings that match the pattern - so you can then re-join these strings to reconstruct the original.
Again, I have added a few more characters (!?,;) to the list of "final characters in the substring". Feel free to tweak this as desired.
(?:\b|$) means "either a word boundary, or the end of the line". This fixes the issue of the result not including the final . in the substrings. Note that I have used a non-capture group (?:) to prevent the result of scan from changing.

How can we eliminate junk value in field?

I have some csv record which are variable in length , for example:
0005464560,45667759,ZAMTR,!To ACC 12345678,DR,79.85
0006786565,34567899,ZAMTR,!To ACC 26575443,DR,1000
I need to seperate each of these fields and I need the last field which should be a money.
However, as I read the file, and unstring the record into fields, I found that the last field contain junk value at the end of itself. The amount(money) field should be 8 characters, 5 digit at the front, 1 dot, 2 digit at the end. The values from the input could be any value such as 13.5, 1000 and 354.23 .
"FILE SECTION"
FD INPUT_FILE.
01 INPUT_REC PIC X(66).
"WORKING STORAGE SECTion"
01 WS_INPUT_REC PIC X(66).
01 WS_AMOUNT_NUM PIC 9(5).9(2).
01 WS_AMOUNT_TXT PIC X(8).
"MAIN SECTION"
UNSTRING INPUT_REC DELIMITED BY ","
INTO WS_ID_1, WS_ID_2, WS_CODE, WS_DESCRIPTION, WS_FLAG, WS_AMOUNT_TXT
MOVE WS_AMOUNT_TXT(1:8) TO WS_AMOUNT_NUM(1:8)
DISPLAY WS_AMOUNT_NUM
From the display, the value is rather normal: 345.23, 1000, just as what are, however, after I wrote the field into a file, here is what they become:
79.85^M^#^#
137.35^M^#
I have inspect the field WS_AMOUNT_NUM, which came from the field WS_AMOUNT_TXT, and found that ^# is a kind of LOW-VALUE. However, I cannot find what is ^M, it is not a space, not a high-value.
I am guessing, but it looks like you may be reading variable length records from a file into a fixed length
COBOL record. The junk
at the end of the COBOL record is giving you some grief. Hard to say how consistent that junk is going
to be from one read to the next (data beyond the bounds of actual input record length are technically
undefined). That junk ends up
being included in WS_AMOUNT_TXT after the UNSTRING
There are a number of ways to solve this problem. The suggestion I am giving you here may not
be optimal, but it is simple and should get the job done.
The last INTO field, WS_AMOUNT_TXT, in your UNSTRING statement is the one that receives all of the trailing
junk. That junk needs to be stripped off. Knowing that the only valid characters in the last field are
digits and the decimal character, you could clean it up as follows:
PERFORM VARYING WS_I FROM LENGTH OF WS_AMOUNT_TXT BY -1
UNTIL WS_I = ZERO
IF WS_AMOUNT_TXT(WS_I:1) IS NUMERIC OR
WS_AMOUNT_TXT(WS_I:1) = '.'
MOVE ZERO TO WS_I
ELSE
MOVE SPACE TO WS_AMOUNT_TXT(WS_I:1)
END-IF
END-PERFORM
The basic idea in the above code is to scan from the end of the last UNSTRING output field
to the beginning replacing anything that is not a valid digit or decimal point with a space.
Once a valid digit/decimal is found, exit the loop on the assumption that the rest will
be valid.
After cleanup use the intrinsic function NUMVAL as outlined in my answer to your
previous question
to convert WS_AMOUNT_TXT into a numeric data type.
One final piece of advice, MOVE SPACES TO INPUT_REC before each READ to blow away data left over
from a previous read that might be left in the buffer. This will protect you when reading a very "short"
record after a "long" one - otherwise you may trip over data left over from the previous read.
Hope this helps.
EDIT Just noticed this answer to your question about reading variable length files. Using a variable length input record is a better approach. Given the
actual input record length you can do something like:
UNSTRING INPUT_REC(1:REC_LEN) INTO...
Where REC_LEN is the variable specified after OCCURS DEPENDING ON for the INPUT_REC file FD. All the junk you are encountering occurs after the end of the record as defined by REC_LEN. Using reference modification as illustrated above trims it off before UNSTRING does its work to separate out the individual data fields.
EDIT 2:
Cannot use reference modification with UNSTRING. Darn... It is possible with some other COBOL dialects but not with OpenVMS COBOL. Try the following:
MOVE INPUT_REC(1:REC_LEN) TO WS_BUFFER
UNSTRING WS_BUFFER INTO...
Where WS_BUFFER is a working storage PIC X variable long enough to hold the longest input record. When you MOVE a short alpha-numeric field to a longer one, the destination field is left justified with spaces used to pad remaining space (ie. WS_BUFFER). Since leading and trailing spaces are acceptable to the NUMVAL fucnction you have exactly what you need.
I have a reason for pushing you in this direction. Any junk that ends up at the trailing end of a record buffer when reading a short record is undefined. There is a possibility that some of that junk just might end up being a digit or a decimal point. Should this occur, the cleanup routine I originally suggested would fail.
EDIT 3:
There are no ^# in the resulting WS_AMOUNT_TXT, but still there are a ^M
Looks like the file system is treating <CR> (that ^M thing) at the end of each record as data.
If the file you are reading came from a Windows platform and you are now
reading it on a UNIX platform that would explain the problem. Under Windows records
are terminated with <CR><LF> while on UNIX they are terminated with <LF> only. The
UNIX file system treats <CR> as if it were part of the record.
If this is the case, you can be pretty sure that there will be a single <CR> at the
end of every record read. There are a number of ways to deal with this:
Method 1: As you already noted, pre-edit the file using Notepad++ or some other
tool to remove the <CR> characters before processing through your COBOL program.
Personally I don't think this is the best way of going about it. I prefer to use a COBOL
only solution since it involves fewer processing steps.
Method 2: Trim the last character from each input record before processing it. The last
character should always be <CR>. Try the following if you
are reading records as variable length and have the actual input record length available.
SUBTRACT 1 FROM REC_LEN
MOVE INPUT_REC(1:REC_LEN) TO WS_BUFFER
UNSTRING WS_BUFFER INTO...
Method 3: Treat <CR> as a delimiter when UNSTRINGing as follows:
UNSTRING INPUT_REC DELIMITED BY "," OR x"0D"
INTO WS_ID_1, WS_ID_2, WS_CODE, WS_DESCRIPTION, WS_FLAG, WS_AMOUNT_TXT
Method 4: Condition the last receiving field from UNSTRING by replacing trailing
non digit/non decimal point characters with spaces. I outlined this solution a litte earlier in this
question. You could also explore the INSPECT statement using the REPLACING option (Format 2). This should be able to do pretty much the same thing - just replace all x"00" by SPACE and x"0D" by SPACE.
Where there is a will, there is a way. Any of the above solutions should work for you. Choose the one you are most comfortable with.
^M is a carriage return.
Would Google Refine be useful for rectifying this data?

Dynamically generate short URLs for a SQL database?

My client has database of over 400,000 customers. Each customer is assigned a GUID. He wants me to select all the records, create a dynamic "short URL" which includes this GUID as a parameter. Then save this short url to a field on each clients record.
The first question I have is do any of the URL shortening sites allow you to programatically create short urls on the fly like this?
TinyUrl allow you to do it (not widely documented), for example:
http://tinyurl.com/api-create.php?url=http://www.stackoverflow.com/
becomes http://tinyurl.com/6fqmtu
So you could have
http://tinyurl.com/api-create.php?url=http://mysite.com/user/xxxx-xxxx-xxxx-xxxx
to http://tinyurl.com/64dva66.
The guid doesn't end up being that clear, but the URLs should be unique
Note that you'd have to pass this through an HTTPWebRequest and get the response.
You can use Google's URL shortner, they have an API.
Here is the docs for that: http://code.google.com/apis/urlshortener/v1/getting_started.html
This URL is not sufficiently short:?
http://www.clientsdomain.com/?customer=267E7DDD-8D01-4F38-A3D8-DCBAA2179609
NOTE: Personally I think your client is asking for something strange. By asking you to create a URL field on each customer record (which will be based on the Customer's GUID through a deterministic algorithm) he is in fact essentially asking you to denormalize the database.
The algorithm URL shortening sites use is very simple:
Store the URL and map it to it's sequence number.
Convert the sequence number (id) to a fixed-length string.
Using just six lowercase letter for the second step will give you many more (24^6) combinations that the current application needs, and there's nothing preventing the use of a larger sequence at some point in time. You can use shorter sequences if you allow for numbers and/or uppercase letters.
The algorithm for the conversion is a base conversion (like when converting to hex), padding with whatever symbol represents zero. This is some Python code for the conversion:
LOWER = [chr(x + ord('a')) for x in range(25)]
DIGITS = [chr(x + ord('0')) for x in range(10)]
MAP = DIGITS + LOWER
def i2text(i, l):
n = len(MAP)
result = ''
while i != 0:
c = i % n
result += MAP[c]
i //= n
padding = MAP[0]*l
return (padding+result)[-l:]
print i2text(0,4)
print i2text(1,4)
print i2text(12,4)
print i2text(36,4)
print i2text(400000,4)
print i2text(1600000,4)
Results:
0000
0001
000c
0011
kib9
4b21
Your URLs would then be of the form http://mydomain.com/myapp/short/kib9.

Resources