Disassemble a flatfile with BizTalk, when record contains nulls [duplicate] - character-encoding

This question already has an answer here:
BizTalk Schema development - hexadecimal value 0x19, is an invalid character
(1 answer)
Closed 3 years ago.
An external partner will be sending us comma-separated files containing purchase orders, which we need to process in our BizTalk environment. I will be using BizTalk's flatfile disassembler to transform the csv data into workable xml.
However, many of the records in our partner's files have null values for certain 'columns' (e.g. not all address fields have a value). Upon disassembly by BizTalk using the flatfile xsd (created with the wizard in Visual Studio), the resulting xml contains elements like this:
<ShipToAddress2></ShipToAddress2>
And our subsequent mappings fails with
System.Xml.XmlException: '.', hexadecimal value 0x00, is an invalid character.
In what way can I eliminate this error? Is there a property in the flatfile xsd, or a setting in the disassember in which I can set the behaviour for handling null values in the data?

If the relevant fields don't already have a padding character set, you can use that to strip the zero characters during disassembly.
For every field that requires it first set the Pad Character Type to Hexadecimal, then Pad Character to 0x00.

Related

Problem reading variables containing mix of numbers and strings

I am reading an Excel file (see syntax below) where some of the fields are text mixed with numbers. The problem is that SPSS reads some of these fields as numeric instead of string and then the text is deleted.
I assume this happens in cases where a large part of the first rows are empty or with a numeric value and then it defines the variable as numeric.
How can this be avoided?
GET DATA
/TYPE=XLSX
/FILE='M:\MyData.xlsx'
/SHEET=name 'Sheet1'
/CELLRANGE=FULL
/READNAMES=ON
/DATATYPEMIN PERCENTAGE=95.0
/HIDDEN IGNORE=YES.
When you use the get data command, the subcommand /DATATYPEMIN PERCENTAGE=95.0 tells SPSS that if up to 5% of the values in the field do not conform to the selected format it's still ok. So in order to avoid cases where only very few values are text and the field is read as number, you have to correct the subcommand to:
/DATATYPEMIN PERCENTAGE=100

tFuzzyMatch apparently not working on Arabic text strings

I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.

Unicode support in .net mvc

The problem I have is I wish to support unicode within my project (mvc project).
Where by a user can post a comment using characaters such as ペ without this becoming ????.
Any information that can be shared on this subject would be greatly appreciated.
You need to understand what is basically a Character Set and Encoding.
A character set defines a set of textual and graphic symbols, each of which is mapped to a set of non negative integers. For example, when the database stores the letter A, it actually stores a numeric code that is interpreted by software as the letter ”A”, the numeric code is called code point or encoded value.
Character encoding is a process of assigning code point to a character; it defines a rule for representing and storing a character in a character set.
Also You need to know what collation is, which is just a bit patterns that represent every character and some rules which are applied on characters being stored and compared, in case you are storing the same in your database.
In a Nutshell you need to change your page charset to charset="UTF-8" for all your web pages, and do the same activity on your database.

Which NLS_LENGTH_SEMANTICS for WE8MSWIN1252 Character Set

We have a database where the character set is set to WE8MSWIN1252 which I understand is a single byte character set.
We created a schema and its tables by running a script with the following:
ALTER SYSTEM SET NLS_LENGTH_SEMANTICS=CHAR
Could we possibly lose data since we are using VARCHAR2 columns with character semantics while the underlying character set is single byte?
If you are using a single-byte character set like Windows-1252, it is irrelevant whether you are using character or byte semantics. Each character occupies exactly one byte so it doesn't matter whether a column is declared VARCHAR2(10 CHAR) or VARCHAR2(10 BYTE). In either case, up to 10 bytes of storage for up to 10 characters will be allocated.
Since you gain no benefit from changing the NLS_LENGTH_SEMANTICS setting, you ought to keep the setting at the default (BYTE) since that is less likely to cause issues with other scripts that you might need to run (such as those from Oracle).
Excellent question. Multi-byte characters will take up the number of bytes required, which could use more storage than you expect. If you store a 4-byte character in a varchar2(4) column, you have used all 4 bytes. If you store a 4-byte character in a varchar2(4 char) column, you have only used 1 character. Many foreign languages and special characters use 2-byte character sets, so it's best to 'know your data' and make your database column definitions accordingly. Oracle does NOT recommend changing NLS_LENGTH_SEMANTICS to CHAR because it will affect every new column defined as CHAR or VARCHAR2, possibly including your catalog tables when you do an in-place upgrade. You can see why this is probably not a good idea. Other Oracle toolsets and interfaces may present issues as well.

How many chars can numeric EDIFACT data elements be long?

In EDIFACT there are numeric data elements, specified e.g. as format n..5 -- we want to store those fields in a database table (with alphanumeric fields, so we can check them). How long must the db-fields be, so we can for sure store every possible valid value? I know it's at least two additional chars (for decimal point (or comma or whatever) and possibly a leading minus sign).
We are building our tables after the UN/EDIFACT standard we use in our message, not the specific guide involved, so we want to be able to store everything matching that standard. But documentation on the numeric data elements isn't really straightforward (or at least I could not find that part).
Thanks for any help
I finally found the information on the UNECE web site in the documentation on UN/EDIFACT rules Part 4. UN/EDIFACT rules Chapter 2.2 Syntax Rules . They don't say it directly, but when you put all the parts together, you get it. See TOC-entry 10: REPRESENTATION OF NUMERIC DATA ELEMENT VALUES.
Here's what it basically says:
10.1: Decimal Mark
Decimal mark must be transmitted (if needed) as specified in UNA (comma or point, put always one character). It shall not be counted as a character of the value when computing the maximum field length of a data element.
10.2: Triad Seperator
Triad separators shall not be used in interchange.
10.3: Sign
[...] If a value is to be indicated to be negative, it shall in transmission be immediately preceded by a minus sign e.g. -112. The minus sign shall not be counted as a character of the value when computing the maximum field length of a data element. However, allowance has to be made for the character in transmission and reception.
To put it together:
Other than the digits themselves there are only two (optional) chars allowed in a numeric field: the decimal seperator and a minus sign (no blanks are permitted in between any of the characters). These two extra chars are not counted against the maximum length of the value in the field.
So the maximum number of characters in a numeric field is the maximal length of the numeric field plus 2. If you want your database to be able to store every syntactically correct value transmitted in a field specified as n..17, your column would have to be 19 chars long (something like varchar(19)). Every EDIFACT-message that has a value longer than 19 chars in a field specified as n..17 does not need to be stored in the DB for semantic checking, because it is already syntactically wrong and can be rejected.
I used EDI Notepad from Liaison to solve a similar challenge. https://liaison.com/products/integrate/edi/edi-notepad
I recommend anyone looking at EDI to at least get their free (express) version of EDI Notepad.
The "high end" version (EDI Notepad Productivity Suite) of their product comes with a "Dictionary Viewer" tool that you can export the min / max lengths of the elements, as well as type. You can export the document to HTML from the Viewer tool. It would also handle ANSI X12 too.

Resources