I have created a job in talend open studio for data integration v5.5.1.
I am trying to find matches between two customer names columns, one is a lookup and the other contain dirty data.
The job runs as expected when the customer names are in english. However, for arabic names, only exact matches are found regardless of the underlying match algorithm i used (levenschtein, metaphone, double metaphone) even with loose bounds for the levenschtein algorithm min 1 max 50).
I suspect this has to do with character encoding. How should I proceed? any way I can operate using the unicode or even UTF-8 interpretation in Talend?
I am using excel data sources through tFileInputExcel
I got it resolved by moving the data to mysql with a UTF-8 collation. Somehow Excel input wasn't preserving the collation.
Related
I am trying to extract data from this Japanese PDF using tabula-py (and tabula-java), but the output is gibberish. In both tabula-py and tabula-java, the output isn't human readable (definitely not Japanese characters), and there are no no error/warning messages. It does seem that the content of the PDF is processed though.
When using the standalone Tabula tool, the characters are encoded properly:
Searching online in the tabula-py and tabula-java documentation, and below are suggestions I could find, but these don't change the output.
Setting the -Dfile.encoding=utf8 (in java call to tabula-py or tabula-java)
Setting chcp 65001 (in Windows command prompt)
I understand Tabula and tabula-java (and tabula-py) use the same library, but is there something different between the two that would explain the difference in encoding output?
Background info
There is nothing unusual in this PDF compared to any other.
The text like any PDF is written in authors random order so for example the 1st PDF body Line (港区内認可保育園等一覧) is the 1262nd block of text added long after the table was started. To hear written order we can use Read Aloud, to verify character and language recognition but unless the PDF was correctly tagged it will also jump from text block to block
So internally the text is rarely tabular the first 8 lines are
1 認可保育園
0歳 1歳 2歳3歳4歳5歳 計
短時間 標準時間
001010 区立
3か月
3455-
4669
芝5-18-1-101
Thus you need text extractors that work in a grid like manner or convert the text layout into a row by row output.
This is where all extractors will be confounded as to how to output such a jumbled dense layout and generally ALL will struggle with this page.
Hence its best to use a good generic solution. It will still need data cleaning but at least you will have some thing to work on.
If you only need a zone from the page it is best to set the boundary of interest to avoid extraneous parsing.
Your "standalone Tabula tool" output is very good but could possibly be better by use pdftotext -layout and adjust some options to produce amore regular order.
Your Question
the difference in encoding output?
The Answer
The output from pdf is not the internal coding, so the desired text output is UTF-8, but PDF does not store the text as UTF-8 or unicode it simply uses numbers from a font character map. IF the map is poor everything would be gibberish, however in this case the map is good, so where does the gibberish arise? It is because that out part is not using UTF-8 and console output is rarely unicode.
You correctly show that console needs to be set to Unicode mode then the output should match (except for the density problem)
The density issue would be easier to handle if preprocessed in a flowing format such as HTML
or using a different language
when I export my data from a .dta file to .sps, my string variables get cut off and the data contains signs, which look like a UTF-8 problem. I think, the problem might be that some string variables have a width over 261- at least they are cut at that point.
Does SPSS have a character limit, and if so, how can I increase the number?
"The values of string variables can contain numbers, letters, and special characters and can be up to 32,767 bytes. "
You seem to be trying to export STATA data (.sta) to SPSS Statistics (.sav). What mechanism are you using in STATA to do this? Does STATA have a limitation on the width of string fields?
As horace_vr has already pointed out, an SPSS Statistics command file has the *.sps extension. Are you truly trying to have STATA write SPSS Statistics Command Syntax for you and save as *.sps? Perhaps you meant you were exporting from *.dta (in STATA) to *.sav (in SPSS Statistics).
Note also that if you have created an SPSS Statistics data file (*.sav) in Codepage mode, but then open it while SPSS Statistics is in Unicode mode, your string widths will triple in size. This is an artifact of the conversion of the various codepages to unicode.
"When code page data files are read in Unicode mode, the defined width of all string variables is tripled. You can use ALTER TYPE to automatically adjust the width of all string variables."
I hope this helps
-ddwyer
I am facing issues where i have multiple files with different charsets, say one file has Chinese charsets and other has French Charsets, how can i load them in a single hive table? I searched online and found this :-
ALTER TABLE mytable SET SERDEPROPERTIES ('serialization.encoding'='SJIS');
With this i can handle charsets for one of the file either Chinese or French. Is there a way to handle both charsets once?
[UPDATE]
Okay i am using RegexSerde for fixed width file alongside encoding scheme being used is - ISO 8859-1. Seems Regex Serde is not taking this encoding scheme into account and splitting the characters considering default UTF-8 encoding scheme. Is there a way to take encoding scheme into account with Regex serde.
I am not sure if this is possible (i think it isn't based on https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/AbstractEncodingAwareSerDe.java). A workaround could be create two tables with different enconding and create a view on top of that.
I am facing a weird problem.
I have extracted data from an Excel file. It should contain an IBAN account number.
Then I tried to analyze the set of account numbers (which the source guarantees to be good) with a Java library.
To keep the scope of the question narrow, I can't explain the following. The below strings are different
03069
03069
The first is a copy & paste from the Excel file, the second is handwritten. Google returns different results for abi [above number] and in fact in the second case I can find that it is the bank code for Intesa Sanpaolo bank (exact page displaying the ABI code, localized, here).
So, to keep the scope narrow: how is that possible? Is it something to do with the encoding?
Try it yourself: do CTRL+F and try type "030", it will select both lines. Now type 6, it will match only the 2nd line.
Same happened in Notepad++
There's an U+200B ZERO WIDTH SPACE in between 030 and 69 in the first text.
Paste the text in https://www.branah.com/unicode-converter for example, or edit in a hexadecimal capable editor.
The solution for cleaning such strings could be for example to whitelist characters, so replace everything that isn't A-Z0-9 will be scrubbed.
I am working with some click stream data and i would need to give specifications to the vendor regarding a preferred format to be consumed by SSIS.
As its URL data in the text file which column delimiter would you recommend. I was thinking pipe "|" but i realize that pipes can be used within the URL.
I did some testing to specify multiple charecters as delimiter lile |^| but when I am creating a flat file connection there is not option in SSIS. I had type these charecters. But when i went to edit the flat file connection manager it had changed to {|}^{|}. It just made me nervous to the import succeeded.
I just wanted to see if anybody has good ideas as to which would safe column delimiter to use.
Probably tab-delimited would be fairly safe, at least assuming that by "clickstream" you mean a list of URLs or something similar. But in theory any delimiter should be fine as long as the supplier quotes the data appropriately.