Mapforce Flex Text as target to make EDI fixed length file - edi

I need to use Altova Mapforce and it's Flex-Text as a target feature to make a fixed length file. I will be pulling data from two sources--one holds employees, the other holds dependents of the employees. The employees have a certain set of fixed length fields, and the dependents have a different set of fixed length fields. The dependents will be tied to employees based upon the employee SSN--the Dependent record will have the Employee SSN in it also.
The end result will be one .txt fixed length file where the employees and dependents will be "interweaved" in such a way that an employee's dependents appear immediately after the employee record.
Flex-text seems pretty straight forward, but going backwards and using flex-text as target is kind of blowing my mind.
Does anyone have some experience and can give me direction?

Related

how to decide if customer is a dimension

I am creating a data warehouse (data mart) for a project-based (labor-centric) organization. (That is, they sell labor-based "projects"; they don't sell physical products.) They are interested in project- and customer-related dimension info. I need to make a design decision about a certain dimension. Should I make this dimension be "Project" (with customer info as attributes on this dimensions)? Or, should I make two separate dimensions -- one for project and another for customer? What are some questions to ask (or things to think about) to help me make this decision?
If the customer and the project represent axisis for analysis you can proceed with the following design :
The customer and the project can be Slowly Changing dimensions where you decide a type from the following :
SCD Type Summary
Type 1 Overwrite the changes
Type 2 History will be added as a new row.
The fact table can handle measures like the cost, the number of working days...

Dimension with a surrogate key into itself (Data Warehouse)

I have an Employee dimension that I am using SCDs and Surrogate keys to track changes over time.
Employee's business system key: EmployeeID
Employee Surrogate key: EmployeeSCDKey
I would like to have Manager information tracked over time as well. The managers are employees like everyone else and as such, I was thinking about having a ManagerSCDKey column in my Employee dimension like so:
Example:
This is the problem I am facing though. The arrow shows the boundary from one transform to the next. In the event that a Manager changes jobs (or some other type 2 SCD field) and a new surrogate key is created for them, that change won't be recognized until the next time the dimension is transformed.
By this I mean that the row in red won't appear until the second transformation, so any fact rows associated with Joe for this time will have outdated manager information.
I guess it boils down to this:
Is there a way to make this pattern work? (dimension with a key into itself?)
Or is there a better practice way to accomplish the same task? I would prefer to not maintain a manager dimension that is extremely similar to the employee dimension, but if that's best practice then so be it.
Here's a good discussion of some alternatives, I'm sure you'll find something that matches what you need: http://www.informationweek.com/software/information-management/kimball-university-five-alternatives-for-better-employee-dimension-modeling/d/d-id/1082326?page_number=1
I'd likely opt for some kind of 'reports to' bridge table, perhaps having natural keys rather than surrogate keys depending on how you want it to behave (and to solve your type 2 SCD table). You wouldn't need to have a separately created manager dimension, only have employee pointing to the bridge table twice.

When choosing an SCD type for a data-warehouse, what things do you have to consider?

For each case what are the considerations for the fact tables ?
How do the changes in the dimension effect the fact tables and how are these handled in the fact tables ?
The simplest part of the answer is about fact tables. There are no changes, regardless of the dimension type. This is because the relationship between the fact and dimension is the dimension's surrogate key.
For dimensions, you need to decide which columns can change, and whether you need to know their previous value.
If none of the columns can change, then SCD0 is usually the most appropriate. You'd use this for something like a calendar, perhaps, where the data is constant, unless we revert to medieval papism instead of an atomic clock :)
Sometimes you don't care about the previous value, only the current value is important regardless of the age of the fact. An example here MIGHT be a customer's telephone number. I say "might", because that significance depends on business rules. These are SCD1 dimensions.
If we care about previous history, we need to make a choice between SCD2 and SCD3.
SCD2 creates a new row each time the dimension data changes. The business key remains constant, but facts relating to the new time period will have the new row surrogate key. An example might be Customer Address, where the customer is always identified by the business key C12345, but fact tables point to IDs 13, 987 and 2465 representing the changes in address as that customer shifted house, town, etc.
SCD3 maintains a "previous" value in the current row. If all we needed to know was the current value of a field and its previous value, we don't need to create a new row each time that value changes. Updating an SCD3 dimension needs to shift the "current" value to the "previous" value, then write the new value to the "current" value.
Now, the terminology gets a little messy, because a dimension can actually combine all of these types in one. Consider a theoretical Bank Account dimension:
ID (surrogate key)
Number (business key)
Account Name (SCD1, depending on business rules)
Opening Balance (SCD0)
Authorised Signatory (SCD2, we want a record of who was authorised at a point in time)
Relationship Manager (SCD3, I want the current and prior)
The SCD type tells me what needs to be updated when any of these columns changes.
SCD0: This value should never change, no updates required.
SCD1: Update ALL rows for the business key.
SCD2: Create a new row whenever this value changes
SCD3: Update all previous and current values for the business key
Kimball further defines SCD4-6, but these are much less commonly used. I won't go into the details, this answer is getting long enough :)
Finally, there is the issue of cardinality to consider. If your fact can be related to more than one dimension row at a time, then you might need a Bridge table to handle the relationship.
In summary:
Your fact tables contain foreign keys to dimension tables
Dimension rows are identified by surrogate keys
There may be many dimension rows for a given business key, depending on history requirements.

Delphi - What Structure allows for SAVING inverted index type of information?

Delphi XE6. Looking to implemented a limited style of search, specifically an edit field for the user to enter a business name which would get looked up. I need to allow the user to enter multiple words, or part of multiple words. For Example, on a business "First Bank of Kansas", user should be able to enter "Fir Kan", and it should return a match. This means an inverted index type of structure. I have some type of list of each unique word, then a (document ID, primary Key ID, etc, which is an integer). I am struggling with WHAT type of structure to make this... I have approximately 250,000 business names, which have 43,500 unique words. Word count will vary from 1 occurrence of a word to several thousand (company, corporation, etc) I have some requirements...
1). Assume the user enters BAN. I need to find ALL words that start with BAN. I need to return BANK, BANKER, etc... This means that whatever structure I use, I have to be able to find BAN and then move to the next alphabetic entry... and keep moving to the next until I find a value that does NOT start with BAN. This eliminates any type of HASH structure, correct?
2). I obviously want this to be fast. HASH is the fastest, but I can't use this, correct? See requirement 1.
3). Each entry in this structure needs to be able to hold a list of integers. If I end up going with a LinkedList, then each element has to hold a list of Integers.
4). I need to be able to save and load this structure. I don't want to have to build it each time I use it.
Whatever I end up with, it appears to have to be a NESTED structure, a higher level list (LinkedList?) with each node being an Integer List.
What am I looking for? What do commercial product use? Outlook, etc have search capabilities.
Every word is linked to a specific set of IDs, each representing a business name, right?.
I recommend using a binary tree data structure because effort for searching is normally log(n), which is quite fast. Especially, if business names are changing at runtime, an AVLTree should do well, although it's quite some work to implement it by yourself. But there should be many ready-to-use units on binary trees all over the internet.
For each successful search for a word in your tree data structure, you should take their list of IDs and aggregate those grouped by the entered word they succeeded for.
As the last step you take all those aggregated lists of IDs and do an intersection.
There should only be IDs left which are fitting to all entered words. Those IDs are referencing the searched business names.

Table Normalization with no Domain values

There is a debate between our ETL team and a Data Modeler on whether a table should be normalized or not, and I was hoping to get some perspective from the online community.
Currently the tables are set up as such
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
Both tables are only being populated by a periodic file (from a 3rd party)
through an ETL job
A single record in the file contains all attributes in both tables for a single row)
The file populating these tables is a delta (only rows with some change in them are in the file)
One change to one attribute for one record (again only by the 3rd party) will result in all the data for that record in the file
The Domain Values for Code and Name are
not known.
Question:Should the LookupTable be denormalized into MainTable.
ETL team: Yes. With this setup, every row from the file will first have to check the 2nd table to see if their FK is in there (insert if it is not), then add the MainTable row. More Code, Worse Performance, and yes slightly more space. However ,regardless of a change to a LookupTable.Name from a 3rd party, the periodic file will reflect every row affected, and we will still have to parse through each row. If lumped into MainTable, all it is, is a simple update or insert.
Data Modeler: This is standard good database design.
Any thoughts?
Build prototypes. Make measurements.
You started with this, which your data modeler says is a standard good database design.
MainTable LookupTable
PrimaryKey (PK) Code (PK)
Code (FK) Name
OtherColumns
He's right. But this, too, is a good database design.
MainTable
PrimaryKey (PK)
Name
OtherColumns
If all updates to these tables come only from the ETL job, you don't need to be terribly concerned about enforcing data integrity through foreign keys. The ETL job would add new names to the lookup table anyway, regardless of what their values happen to be. Data integrity depends mainly on the system the data is extracted from. (And the quality of the ETL job.)
With this setup, every row from the file will first have to check the
2nd table to see if their FK is in there (insert if it is not), then
add the MainTable row.
If they're doing row-by-row processing, hire new ETL guys. Seriously.
More Code, Worse Performance, and yes slightly more space.
They'll need a little more code to update two tables instead of one. How long does it take to write the SQL statements? How long to run them? (How long each way?)
Worse performance? Maybe. Maybe not. If you use a fixed-width code, like an integer or char(3), updates to the codes won't affect the width of the row. And since the codes are shorter than the names, more rows might fit in a page. (It doesn't make any sense to use a code that longer than the name.) More rows per page usually means less I/O.
Less space, surely. Because you're storing a short code instead of a long name in every row of "MainTable".
For example, the average length of a country name is about 11.4 characters. If you used 3-character ISO country codes, you'd save an average of 8.4 bytes per row in "MainTable". For 100 million rows, you save about 840 million bytes. The size of that lookup table is negligible, about 6k.
And you don't usually need a join to get the full name; country codes are intended to be human-readable without expansion.

Resources