How can I write a routine in ESQL/C which will initialize to 0 (zero) all numeric fields (smallint, decimal, etc) and to a space the other fields in a table?
You asked this question on one of the IIUG (International Informix Users Group) mailing lists, and received answers there.
As I mentioned when I responded this morning, you need to think rather carefully about some of the types you don't mention. Zeroing a DATE sets it to 1899-12-31. Zeroing an INTERVAL makes sense (though you don't do it by setting all bytes zero; a similar comment applies to DECIMAL and MONEY and related types). Zeroing a DATETIME generates an invalid value. And BYTE, TEXT, BLOB, CLOB have separate sets of issues.
Setting character fields to a space is easy enough. Setting integer fields of various sorts to zero is also easy enough.
You can look at the code in either Art Kagel's utils2_ak package or in my SQLCMD package, both available from the IIUG Software Archive. They have code that cover many of the situations you're likely to encounter.
If they're not sufficient to help, then you need to show the code you're having problems with, explaining what is happening, what you want to happen, and (if appropriate) any messages your getting that are preventing it happening.
Related
I'm developing an application that parses Cobol programs. In these programs some respect the traditional coding style (programm text from column 8 to 72), and some are newer and don't follow this style.
In my application I need to determine the coding style in order to know if I should parse content after column 72.
I've been able to determine if the program start at column 1 or 8, but prog that start at column 1 can also follow the rule of comments after column 72.
So I'm trying to find rules that will allow me to determine if texts after column 72 are comments or valid code.
I've find some but it's hard to tell if it will work everytime :
dot after column 72, determine the end of sentence but I fear that dot can be in comments too
find the close character of a statement after column 72 : " ' ) }
look for char at columns 71 - 72 - 73, if there is not space then find the whole word, and check if it's a key word or a var. Problem, it can be a var from a COPY or a replacement etc...
I'd like to know what do you think of these rules and if you have any ideas to help me determine the coding style of a Cobol program.
I don't need an API or something just solid rules that I will be able to rely on.
I think you need to know the COBOL compiler for each program. Its documentation should tell you what conventions/configurations/switches it uses to decide if the source code ends at column 72 or not.
So.... which compiler(s)?
And if you think the column 72 issue is a pain, wait till you get around to actually parsing the COBOL itself. If you are not well prepared to handle the lexical issues of the language, you are probably very badly prepared to handle the syntactic ones.
There is no absolutely reliable way to determine if a COBOL program
is in fixed or free format based only on the source code. Heck it is sometimes difficult to identify
the programming language based only on source code. Check out
this classic polyglot - it is valid under 8 different language compilers. That
said, you could try a few heuristics that might yield
the correct answer more often than not.
Compiler directives imbedded in source code
Watch for certain compiler directives that determine code format.
Unfortunately, every compiler vendor uses their own flavour of directive.
For example, Microfocus COBOL uses the
SOURCEFORMAT directive. This directive will appear near the top of the program so a short pre-scan
could be used to find it. On the other hand, OpenCobol uses >>SOURCE FORMAT IS FREE and
>>SOURCE FORMAT IS FIXED to toggle between free and fixed format, different parts of the same program
could be formatted differently!
The bottom line here is that you will have to support the conventions of multiple COBOL compilers.
Compiler switches
Source code format can be also be specified using a compiler switch. In this case, there are no concrete
clues to go on. However, you can be reasonably sure that the entire source program will be either
fixed or free. All you can do here is guess. Unless the programmer is out to "mess with
your head" (and some will), a program in free format will have the keywords IDENTIFICATION DIVISION or ID DIVISION, starting before column 8.
Every COBOL program will begin with these keywords so you can use them as the anchor point for determining code format in the
absence of imbedded compiler directives.
Warning - this is far from fool proof, but might be a good start.
There won't be an algorithm to do this with 100% certainty, because if comments can be anything, they can also be compilable COBOL code. So you could theoretically write a program that means one thing if the comments are ignored, and something else entirely if the comments are treated as part of the COBOL.
But that's extremely unlikely. What's most likely to happen is that if you try to compile the code under the wrong convention, it will simply fail. So the only accurate way to do this is to try compiling/parsing the program one way, and if you come to a line that can't make sense, switch to the other style. You could also support passing an argument to the compiler when the style is already known.
You can try using heuristics like what you've described, but that will never be totally accurate. The most they can give you is a probability that the code is one or the other style, which will increase as they examine more and more lines of code. They could be useful for helping you guess the style before you start compiling, or for figuring out when the problem is really just a typo in the code.
EDIT:
Regarding ideas for heuristics, it's hard to say. If there were a standard comment sigil like // or # in other languages, this would be a lot easier (actually, there is, but it sounds like your code doesn't follow this convention). The only thing I can think of would be to check whether every line (or maybe 99% of lines, and not counting empty lines or lines commented with *) has a period somewhere before position 72.
One thing you DON'T want to do is apply any heuristics to the part after position 72. That is, you don't want to be checking the comments to see if they're valid COBOL. You want to check what you know is COBOL first, and see if that works by itself. There are several reasons for this:
Comments written in English are likely to have periods and quotes in them, so your first and second bullet points are out.
Natural languages are WAY harder to parse than something like COBOL.
The comments could easily have COBOL in them (maybe someone commented out the previous version of the line).
An important rule for comments is that they should never affect what the program does. If changing the comments can change how the program is compiled, you violate that.
All that in mind, my opinion is that you shouldn't use heuristics at all. You should always try to compile the program under both conventions unless one is explicitly specified. There's a chance that code will compile successfully under both conventions, and then you'll have two different programs and no way to tell which one is correct.
If that happens, you need to compare the two results (perhaps with a hash or something) to see if they're the same program. If they're the same, great, but if not, you'll need to force the user to explicitly choose a convention.
Most COBOL compilers will allow you to generate and analyze the post text manipulation phase.
The text preprocessor output can be seen (using OpenCOBOL for the example)
cobc -E program.cob
The text manipulation processor deals with any COPY ... REPLACING compiler directives, as well as converting SOURCE FORMAT IS FIXED (with line continuations, string literal concatenations, comment line removal, among other things) to the actual free format that the compiler lexical analyzer needs. A lot of the OpenCOBOL toolkits (Cross referencer and Animator, to name two) use source code AFTER the preprocessor pass. I don't think you'll lose any street cred if your parser program relies on post processed source code files.
I've recently been receiving a lot of first name only entries in a form. While maybe I should have had 2 separate first and last name fields this always seemed to me a bit much. But I would like to try and get a full name which basically can only be determined by having at least one space.
I came up with this, but I'm wondering if someone has a better and possibly simpler solution?
/([a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð,.'-]{2,}) ([a-zA-ZàáâäãåèéêëìíîïòóôöõøùúûüÿýñçčšžÀÁÂÄÃÅÈÉÊËÌÍÎÏÒÓÔÖÕØÙÚÛÜŸÝÑßÇŒÆČŠŽ∂ð,.'-]{2,})/
This is basically this /([a-zA-Z,.'-]) ([a-zA-Z,.'-])/ plus unicode support.
I'd first make sure that you really do need people to give you a last name. Is that a genuine requirement? If not, I'd skip it because it adds unnecessary complication and barriers to entry. If it really IS a requirement, it probably makes sense to have separate first and last name fields in your UI so that it's explicit.
The fact that you didn't do that to begin with suggests that you might not really need the last name as much as you think you do.
To answer your original question, this expression might give you what you're looking for without the guesswork:
/[\w]+([\s]+[\w]+){1}+/
It checks that the string contains at least 2 words separated by whitespace. Like Tim Pietzcker pointed out, validating the words themselves is prone to error.
In Ruby 1.9, you have access to Unicode properties (\p{L} is a Unicode letter). But trying to validate a name in any way (regex or not) is prone to failure because names are not what you think they are.
Your theory that "if there's a space, there must be a last name there" is incorrect, too - think of first and middle names...
I have an application for storing many strings in a TStringList. The strings will be largely similar to one another and it occurs to me that one could compress them on the fly - i.e. store a given string in terms of a mixture of unique text fragments plus references to previously stored fragments. StringLists such as lists of fully-qualified path and filenames should be able to be compressed greatly.
Does anyone know of a TStringlist descendant that implement this - i.e. provides read and write access to the uncompressed strings but stores them internally compressed, so that a TStringList.SaveToFile produces a compressed file?
While you could implement this by uncompressing the entire stringlist before each access and re-compressing it afterwards, it would be unnecessarily slow. I'm after something that is efficient for incremental operations and random "seeks" and reads.
TIA
Ross
I don't think there's any freely available implementation around for this (not that I know of anyway, although I've written at least 3 similar constructs in commercial code), so you'd have to roll your own.
The remark Marcelo made about adding items in order is very relevant, as I suppose you'll probably want to compress the data at addition time - having quick access to entries already similar to the one being added, gives a much better performance than having to look up a 'best fit entry' (needed for similarity-compression) over the entire set.
Another thing you might want to read up about, are 'ropes' - a conceptually different type than strings, which I already suggested to Marco Cantu a while back. At the cost of a next-pointer per 'twine' (for lack of a better word) you can concatenate parts of a string without keeping any duplicate data around. The main problem is how to retrieve the parts that can be combined into a new 'rope', representing your original string. Once that problem is solved, you can reconstruct the data as a string at any time, while still having compact storage.
If you don't want to go the 'rope' route, you could also try something called 'prefix reduction', which is a simple form of compression - just start out each string with an index of a previous string and the number of characters that should be treated as a prefix for the new string. Be aware that you should not recurse this too far back, or access-speed will suffer greatly. In one simple implementation, I did a mod 16 on the index, to establish the entry at which prefix-reduction started, which gave me on average about 40% memory savings (this number is completely data-dependant of course).
You could try to wrap a Delphi or COM API around Judy arrays. The JudySL type would do the trick, and has a fairly simple interface.
EDIT: I assume you are storing unique strings and want to (or are happy to) store them in lexicographical order. If these constraints aren't acceptable, then Judy arrays are not for you. Mind you, any compression system will suffer if you don't sort your strings.
I suppose you expect general flexibility from the list (including delete operation), in this case I don't know about any out of the box solution, but I'd suggest one of the two approaches:
You split your string into words and
keep separated growning dictionary
to reference the words and save list of indexes internally
You implement something related to
zlib stream available in Delphi, but operating by the block that
for example can contains 10-100
strings. In this case you still have
to recompress/compress the complete
block, but the "price" you pay is lower.
I dont think you really want to compress TStrings items in memory, because it terribly ineffecient. I suggest you to look at TStream implementation in Zlib unit. Just wrap regular stream into TDecompressionStream on load and TCompressionStream on save (you can even emit gzip header there).
Hint: you will want to override LoadFromStream/SaveToStream instead of LoadFromFile/SaveToFile
I'm extensively using hash map data structures in my program. I'm using a hash map implementation by Barry Kelly posted on the Codegear forums. That implementation internally uses RTL's CompareText function. Profiling made me realize that A LOT of time is spent in SysUtils CompareText function.
I had a look at the
Fastcode site
and found some faster implementations of CompareText. Unfortunately they seem not to work for D2009 and its unicode strings.
Now for the question: Is there a similar faster version that supports D2009 strings? The CompareText functions seems to be called a lot when using hash maps (at least in the implemenation I'm currently using), so little performance improvements could really make a difference. Or should the implementations presented there also work for unicode strings?
Many of the FastCode functions will probably compile and appear to work just fine in Delphi 2009, but they won't be right for all input. The ones that are implemented in assembler will fail because they assume characters are just one byte each. The ones implemented in Delphi will fare a little better, but they'll still return incorrect results sometimes because the old CompareText's notion of "case-insensitive" is based on ASCII whereas the new one should be based on Unicode. The rules for which characters are considered the same save for case are much different for Unicode from how they are for ASCII.
Andreas says in a comment below that Unicode CompareText still uses the ASCII case-comparison rules, so a number of the FastCode functions should work fine. Just look them over before using them to make sure they're not making any character-size assumptions. I seem to recall that some FastCode functions were incorporated into the Delphi RTL already. I have no idea whether CompareText was one of them.
If you're calling CompareText a lot in a hash table, then that suggests your hash table isn't doing a very good job. CompareText should only get called when the hash of the thing you're searching for designated a non-empty bucket in the hash table. From there, a hash table will often use a linear search to find the right item in the bucket, and it will call CompareText for every item during that search. I don't know whether that's how the one you're using works.
You might solve this by using a different hash function that distributes its results more evenly over the available buckets. If your buckets are already evenly filled, then you may need more buckets (and then make sure the hash function still distributes evenly over that number as well).
If the hash-map class you're using is based on TBucketList, then there is room for improvement in the bucket storage. That class doesn't calculate a hash on the entire input. It uses the input only to determine the bucket to use. If the class would also keep track of the full hash computed for a string, then comparisons during the linear search could go much faster. Just compare the hashes, and only compare the strings when the hashes match completely. (For a 256-bucket bucket-list, the largest supported size, only one byte of the input determines the bucket, and the rest of the bytes are ignored.) I've written about TBucketList here before.
As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
There are the standard A-Z, a-z characters, but also there are hyphens, em dashes, quotes, etc.
Plus, there are all of the international characters, like umlauts, etc.
So, for an English-based system, what's the complete set? What about sets for other languages? What about UTF8, UTF16, etc?
Bonus question: How many name fields are needed, and what are their maximum lengths?
EDIT: There are definitely two different types of characters involved in people's names, those that are there as part of the context, and those that are there for structural reasons. I don't want to limit or interfere with the context characters, but I do need to deal with the structural ones.
For example, I had a name come in that was separated by an em dash, but it was hard to distinguish that from the minus character. To make the system easier for searching, I want to take all five different types of dashes, and map them onto one unique character (minus), that way the searcher doesn't need to know specifically which symbol was initially entered.
The problem exists for dashes, probably quotes as well, but also how many other symbols?
There's good article by the W3C called Personal names around the world that explains the problems (and possible solutions) pretty well (it was originally a two-part blog post by Richard Ishida: part 1 and part 2)
Personally I'd say: support every printable Unicode-Character and to be safe provide just a single field "name" that contains the full, formatted name. This way you can store pretty much every form of name. You might need a more structured storage, but then don't expect to be able to store every single combination in a structured form, as there are simply too many different ones.
Whitelisting characters that could appear in a person's name is the wrong way to go, if you ask me. Sure, [A-Za-z] is a fair starting point, but, as you said, you get problems with "European" names. So you map all the umlauts, circumflexes and those. What about Chinese names? Japanese? Indian? Hebrew? You're entering a battle against wind turbines.
If you absolutely must check the validity of someone's name, I'd suggest doing a modest blacklist of certain characters. Braces, mathematical characters, some punctuation and such might be safe to ignore. But I'd be cautious, if I were you.
It might be best to just accept whatever comes in. UTF-16 should be today's overkill character set, that should be adequate for some years to come.
Edit: As for your question about name length and amount of names. If you really want people to write their real and complete names, I guess the only foolproof answer to both of those questions would be "infinite". Not being able to whip out any real examples for human beings, but surely there are analogous examples for humans as the native name for the city of Bangkok.
I don't think there's a definitive answer. After all, some people have names that can't even be expressed in UTF-16...
There are some odd people out there, who will give their kids the craziest of names, including putting in weird punctuation, accents that don't exist in their own language, etc.
However, you can place arbitrary restrictions on your database. If you want to you can insist on 7 bit ASCII names. It's slightly rude to users, but they'll live with it. It certainly makes searching easier.
My colleague's daughter is named Amélie. But even some (not all!) official British government web sites ("Please enter the name exactly as shown on the birth certificate") won't accept the unicode, so he has to use 'Amelie' instead.
Any character that can be represented by any multiple of eight bits (greater than zero) is a possible character for a person's name. Lengths of both names and encodings are arbitrary, so no upper bound should be considered.
Just make sure you sanitize your database inputs so little Bobby Drop-tables doesn't get ya.
On the issue of name fields, the WRONG answer is first name, middle initial, last name, etc. for many reasons.
Many people are known by their middle name, and formally use a first initial, middle name, last name format.
In some cultures, the surname is the first name, and the given name is the last name.
Multiple first and/or middle given names is getting more common. As #Dour High Arch points out, the other extreme is people with only one word in their name.
In an object-oriented database, you would store a Name object with methods to return a directory-style or signature-style name; and the backing store would contain whatever data was necessary to support those methods.
I haven't yet seen a relational database model that improves on the model of two variable-length strings for directory-style and signature-style names.
I'm making software for driving schools in the USA, so to me what matters most what the state DMV's accept as a proper name on a driver's license. In my case, it would cause problems to allow names beyond what the DMV allows, even if such names were legal because the same name must later be used for a driver's license.
From StackOverflow, I still hadn't confirmed the answer I needed. And I happen to know that in my state (Calif) they're using AS400's with software probably written in COBOL, and to the best of my knowledge, those only support an 8-bit character set. (Is it EBCDIC?) Anyway... Ugh.
So, I called the California DMV... Sure enough, their system allows A-Z and spaces and absolutely nothing else. Not even hyphens are allowed -- Hyphens are replaced with spaces. In fact, apparently just to be difficult, they only use capitals. And names such as "O'Malley" must be replaced with OMALLEY.
Leave it to government. I must say I'm thrilled not to be a developer working for DMV. (Although I could really use that kind of salary.)
It really depends on what the app is supposed to be used for.
Sure, in theory it's great if you allow every script on god's green earth to be used, but if the DB is also used by support staff, are they going to be able to handle names in Japanese, Hebrew and Thai script? Can you printer, if it's used to print postage labels?
You might add an extra field "Latin Transcription", but IMO it's really OK to restrict it to ISO-8859-1 characters - People who don't use Latin characters are by now so used to having to use a transcription that they don't mind it anymore, unless they're hardcore nationalists.
UTF-8 should be good enough, as far as name fields, you'll want at minimum a first name and last.
Depending on the complexity of your name structure I could see:
First Name
Middle Initial/Middle Name
Last Name
Suffix (Jr. Sr. II, III, IV, etc.)
Prefix (Mr., Mrs., Ms., etc.)
What do you do when you have "The Artist Formerly Known as Prince". That symbol he used is not a character in the unicode set (AFAIK).
It's some levity, but at the same time, names are a rather broad concept that doesn't lend itself well to a structured format. In this case, something free-form might be most appropriate.