SSIS: Can't handle line-feeds in CSV (Column delimiter not found) - character-encoding

I have some CSV files that appear OK in Notepad and Excel however seem to have extra line-feeds in them when I view them in VS2010 or Notepad++. When I attempt to process them in SSIS, the files fail with errors like this:
Error: 0xC0202055 at Merge Files, Interface [225]: The column delimiter for column "Column 48" was not found.
Here's a truncated example (there's about 50 columns, and the line-wrap appears to wrap randomly at the same position):
The questions are: how does Notepad and Excel open these files OK (and seemingly ignore the line-feeds)? Is there a way to get SSIS to process these files? Could it be an SSIS setting on code-page etc?

For me opening the file in Excel, saving as an excel file (xlsx but I am sure the old xls format would work fine too), then using the Excel Source in SSIS enabled me to load a file into a SQL table with this kind of problem.
Obviously this would not work if you need to load this kind of file regularly or if there was many of these files. In that case the first answer would be better.

The easiest solution for us was to stage the input into a SQL table, and then in a subsequent data-flow, query it back-out without line-feeds in the CSV output, e.g.
SELECT COLUMN1
,REPLACE(REPLACE([COLUMN2],CHAR(10),''),CHAR(13),'') AS [COLUMN2]
FROM TABLE

Related

How do I make Cypher respect character encoding when using LOAD CSV in browser?

My case: List of Danish-named students (with names including characters as ü,æ,ø,å). Minimal Working Example
CSV file:
Fornavn;Efternavn;Mobil;Adresse
Øjvind;Ørnenæb;87654321;Paradisæblevej 125, 5610 Åkirkeby
Süzette;Ågård;12345678;Ærøvej 123, 2000 Frederiksberg
In-browser neo4j-editor:
$ LOAD CSV WITH HEADERS FROM 'file:///path/to/file.csv' AS line FIELDTERMINATOR ";"
CREATE (:Elev {fornavn: line.Fornavn, efternavn: line.Efternavn, mobil: line.Mobilnr, adresse: line.Adresse})
Resulting in registrations like:
Neo4j browser screenshot, containing ?-characters, where Danish/German characters are wanted. My data come from a Learning Management System into Excel. When exporting as CSV from Excel, I can control file encoding as a function of the Save As dialogue box. I have tried encoding from Excel as "UTF-8" (which the Neo4j manual says it wants), "ISO-Western European", "Windows-Western European", "Unicode" in separately named file, and adjusted the FROM 'file:///path/to/file.csv' clause accordingly.
Intriguingly, exactly the same misrepresentation results, independent of which (apparent?) file encoding, I request from Excel when "Saving As". When Copy-pasting the names and addresses directly into the editor, I do not encounter the same problem.
Check Michael Hunger's blog post here which contains some tips, namely:
if you use non-ascii characters (umlauts, accents etc.) make sure to use the appropriate locale or provide the System property -Dfile.encoding=UTF8

Gettext - Detecting duplicate messages with different variable key names

I have recently started i18n my django project, and I have .po files. However in my templates, I have done things suboptimaly. I have just copied the local variable name for something that appears a lot. So I have near-duplicates in .po file, like %(num)s messages and %(num_messages)s messages and %d messages. I should have written them all the same way, so that there is only one translation.
Is there any way/software that can read my .po file and tell me these messages that I should merge?
I am afraid I do not know any tool that has such functionality built in. What you could try to do instead is use your favourite reg-ex aware text editor and Excel:
1) Paste the content of your po file into column A of a new spreadsheet
2) Open the content of your po file in your favourite reg-ex aware text editor and try to reduce all long variables to their shortest variant: in your example you could replace %\([^\)]+\)s by %d. Or replace all variables by some string that does not occur anywhere else, like RORYS_PLACEHOLDER.
3) Paste the content of your po file with normalized or no more variables into column B of the spreadsheet.
4) Set a filter for strings that start with msgid, then let Excel highlight duplicate values in column B. (Home > Conditional Formatting > Highlight Cell Rules > Duplicate Values in Excel 2013)
Of course your po file may be too complex for this approach, but it is worth a try.

retrieve txt content of as many file types as possible

I maintain a client server DMS written in Delphi/Sql Server.
I would like to allow the users to search a string inside all the documents stored in the db. (files are stored as blob, they are stored as zipped files to save space).
My idea is to index them on "checkin", so as i store a nwe file I extract all the text information in it and put it in a new DB field. So somehow my files table will be:
ID_FILE integer
ZIPPED_FILE blob
TEXT_CONTENT text field (nvarchar in sql server)
I would like to support "indexing" of at least most common text-like files, such as:pdf, txt, rtf, doc, docx,pdf, may be adding xls and xlsx, ppt, pptx.
For MS Office files I can use ActiveX since I alerady do it in my application, for txt files i can simply read the file, but for pdf and odt?
Could you suggest the best techinque or even a 3rd party component (not free too) that parses with "no fear" all file types?
Thanks
searching documents this way would leed to a very slow and inconvenient to use, I'd advice you produce two additional tables instead of TEXT_CONTENT field.
When you parse the text, you should extract valuable words and try to standardise them so that you
- get rid of lower/upper case problems
- get rid of characters that might be used interchangeably.
i.e. in Turkish we have ç character that might be entered as c.
- get rid of verbs that are common in the language you are dealing with.
i.e. "Thing I am looking for", "Thing" "Looking" might be in your interest
- get rid of whatever problem use face.
Each word, that has already an entry in the table should re-use the ID already given in the string_search table.
the records may look like this.
original_file_table
zip_id number
zip_file blob
string_search
str_id number
standardized_word text (or any string type with an appropriate secondary index)
file_string_reference
zip_id number
str_id number
I hope that I could give you the idea what I am thinking of.
Your major problem is zipping your files before putting them as a blob in your database which makes them unsearchable by the database itself. I would suggest the following.
Don't zip files you put in the database. Disk space is cheap.
You can write a query like this as long as you save the files in a text field.
Select * from MyFileTable Where MyFileData like '%Thing I am looking for%'
This is slow but it will work. This will work because the text in most of those file types is in plain text not binary (though some of the newer file types are now binary)
The other alternative is to use an indexing engine such as Apache Lucene or Apache Solr which will as you put it
parses with "no fear" all file types?

Recommended column delimiter for Click stream data to consumed by SSIS

I am working with some click stream data and i would need to give specifications to the vendor regarding a preferred format to be consumed by SSIS.
As its URL data in the text file which column delimiter would you recommend. I was thinking pipe "|" but i realize that pipes can be used within the URL.
I did some testing to specify multiple charecters as delimiter lile |^| but when I am creating a flat file connection there is not option in SSIS. I had type these charecters. But when i went to edit the flat file connection manager it had changed to {|}^{|}. It just made me nervous to the import succeeded.
I just wanted to see if anybody has good ideas as to which would safe column delimiter to use.
Probably tab-delimited would be fairly safe, at least assuming that by "clickstream" you mean a list of URLs or something similar. But in theory any delimiter should be fine as long as the supplier quotes the data appropriately.

Loss of white space when saving HTML table as Excel

I've successfully got my web application to export Excel files by creating HTML tables and returning that with an Excel data type.
If I open the file up in Excel, and save it, the markup changes from:
<td>One Two</td>
to...
<td>One
Two</td>
It seems like Excel is wrapping the lines with \r\n, but not putting a space in there. In Excel, it renders the cell with white space, but when I read it with OLEDB, there is no white space.
Can this be resolved by reading it differently, or exporting it differently, or perhaps adding some MS specific CSS?

Resources