Recommended column delimiter for Click stream data to consumed by SSIS - url

I am working with some click stream data and i would need to give specifications to the vendor regarding a preferred format to be consumed by SSIS.
As its URL data in the text file which column delimiter would you recommend. I was thinking pipe "|" but i realize that pipes can be used within the URL.
I did some testing to specify multiple charecters as delimiter lile |^| but when I am creating a flat file connection there is not option in SSIS. I had type these charecters. But when i went to edit the flat file connection manager it had changed to {|}^{|}. It just made me nervous to the import succeeded.
I just wanted to see if anybody has good ideas as to which would safe column delimiter to use.

Probably tab-delimited would be fairly safe, at least assuming that by "clickstream" you mean a list of URLs or something similar. But in theory any delimiter should be fine as long as the supplier quotes the data appropriately.

Related

How to change delimiter in the exported data in Hybris?

I am exporting data from SAP Hybris.
The data I am importing also has semicolons (;).
In the exported data I see the delimiter is ; This is preventing me from splitting the data and do my work. Is there a way to change this delimiter to something else ?
I understand this can be achieved by changing the "csv.fieldseparator" property, but that would affect everywhere and I can't afford that in production.. Any suggestions would be appreciated
Go to backoffice.
Search export.
In the advanced configurations set your new delimiter. By default,
it is semi-colon (;).

Source data and target data mismatching. how can we insert special charcters like 5±3°C and -

I loaded data in to target table. Data is not coming exactly as source can you suggest
SOURCE DATA
ID4581 PEG-INTRON Provide stab data out to end of shelf-life 36 mths at 5±3°C as soon as data becomes avail for PEG Intron Pwdr for Inj vial btchs: 1-IQA-403, 1-IQJ-402 1-IQC-404
TARGET DATA
ID4581 PEG-INTRON Provide stab data out to end of shelf-life 36 mths at 5�3�C as soon as data becomes avail for PEG Intron Pwdr for Inj vial btchs: 1-IQA-403, 1-IQJ-402 1-IQC-404
how can I insert special characters.
Your special characters need an encoding. In the current web based system the encoding for the characters you need that seems to be best suited would be UTF-8.
So make sure everything is set to use the UTF-8 encoding and it'll work.
So your APIs, database definitions, database data, database connection, scripting, frontend etc: all use systematically UTF-8. If you need something like a 3rd party interface that's not defined in UTF-8, make sure to apply appropriate translations on input and output.

SSIS pipe delimiter issue for CRLF csv file

I am Facing an below pipe delimiter issue in SSIS.
CRLF Pipe delimited text file:
-----------------------------
Col1|Col2 |Col3
1 |A/C No|2015
2 |A|C No|2016
Because of embedded pipe within pipes SSIS failing to read the data.
Bad news: once you have a file with this problem, there is NO standard way for ANY software program to correctly parse the file.
Good news: if you can control (or affect) the way the file is generated to begin with, you would usually address this problem by including what is called a "Text Delimiter" (for example, having field values surrounded by double quotes) in addition to the Field Delimiter (pipe). The Text Delimiter will help because a program (like SSIS) can tell the field values apart from the delimiters, even if the values contain the Field Delimiter (e.g. pipes).
If you can't control how the file is generated, the best you can usually do is GUESS, which is problematic for obvious reasons.

Best practices for creating a CSV file?

I am working in Swift although perhaps the language is not as relevant, and I am creating a relatively simple CSV file.
I wanted to ask for some recommendations in creating the files, in particular:
Should I wrap each column/value in single or double quotes? Or nothing? I understand if I use quotes I'll need to escape them appropriately in case the text in my file legitimately has those values. Same for \r\n
Is it ok to end each line with \r\n ? Anything specific to Mac vs. Windows I need to think about?
What encoding should I use? I'd like to make sure my csv file can be read by most readers (so on mobile devices, mac, windows, etc.)
Any other recommendations / tips to make sure the quality of my CSV is ideal for most readers?
I have a couple of apps that create CSV files.
Any column value that contains a newline or the field separator must be enclosed in quotes (double quotes is common, single quotes less so).
I end lines with just \n.
You may wish to give the user some options when creating the CSV file. Let them choose the field separator. While the comma is common, a tab is also common. You can also use a semi-colon, space, or other characters. Just be sure to properly quote values that contain the chosen field separator.
Using UTF-8 encoding is arguably the best choice for encoding the file. It lets you support all Unicode characters and just about any tool that supports CSV can handled UTF-8. It avoid any issues with platform specific encodings. But again, depending on the needs of your users, you may wish to give them the choice of encoding.

retrieve txt content of as many file types as possible

I maintain a client server DMS written in Delphi/Sql Server.
I would like to allow the users to search a string inside all the documents stored in the db. (files are stored as blob, they are stored as zipped files to save space).
My idea is to index them on "checkin", so as i store a nwe file I extract all the text information in it and put it in a new DB field. So somehow my files table will be:
ID_FILE integer
ZIPPED_FILE blob
TEXT_CONTENT text field (nvarchar in sql server)
I would like to support "indexing" of at least most common text-like files, such as:pdf, txt, rtf, doc, docx,pdf, may be adding xls and xlsx, ppt, pptx.
For MS Office files I can use ActiveX since I alerady do it in my application, for txt files i can simply read the file, but for pdf and odt?
Could you suggest the best techinque or even a 3rd party component (not free too) that parses with "no fear" all file types?
Thanks
searching documents this way would leed to a very slow and inconvenient to use, I'd advice you produce two additional tables instead of TEXT_CONTENT field.
When you parse the text, you should extract valuable words and try to standardise them so that you
- get rid of lower/upper case problems
- get rid of characters that might be used interchangeably.
i.e. in Turkish we have ç character that might be entered as c.
- get rid of verbs that are common in the language you are dealing with.
i.e. "Thing I am looking for", "Thing" "Looking" might be in your interest
- get rid of whatever problem use face.
Each word, that has already an entry in the table should re-use the ID already given in the string_search table.
the records may look like this.
original_file_table
zip_id number
zip_file blob
string_search
str_id number
standardized_word text (or any string type with an appropriate secondary index)
file_string_reference
zip_id number
str_id number
I hope that I could give you the idea what I am thinking of.
Your major problem is zipping your files before putting them as a blob in your database which makes them unsearchable by the database itself. I would suggest the following.
Don't zip files you put in the database. Disk space is cheap.
You can write a query like this as long as you save the files in a text field.
Select * from MyFileTable Where MyFileData like '%Thing I am looking for%'
This is slow but it will work. This will work because the text in most of those file types is in plain text not binary (though some of the newer file types are now binary)
The other alternative is to use an indexing engine such as Apache Lucene or Apache Solr which will as you put it
parses with "no fear" all file types?

Resources