I need to write a small ETL pipeline because I need to move some data from a source database to a target database (a datawarehouse) to perform some analysis on data.
Among those data, I need to clean and conform the name of cities. Cities are inserted manually by international users, conseguently for a single city I can have multiple names (for example London or Londra).
In my source database I do not have only big cities but I have also small villages.
Well, if I do not standardize city names, our analysis could be nonsensical.
Which is the best practices to standardize cities in my target database? Have any idea or suggestion I can undertake?
Thank you
The only reliable way to do this is to use commercial address validation software - preferably in your source system when the data is being created but it could be integrated into your data pipeline processes.
Assuming you can't afford/justify the use of commercial software, the only other solution is to create your own translation table i.e. a table that holds the values that are entered and what value you want them to be translated to.
While you can build this table based on historic data, there will always be new values that are not in the table, so you would need a process to identify these, add the new record to your translation data and then fix the affected records. You would also need to accept that there would be un-cleansed data in your warehouse for a period of time after each data load
Related
I am using a time series database (InfluxDB) and I am trying to understand how to design a measurement (table).
My background is using relational database where it is common to join tables.
In my current project we are writing different sensor values like (temperature and pressure) for many
vehicles to a measurement along with associated identifiers so that we know the specific details of
the each value we measure.
Measurement: Sensor_Trans
Tags: time, vehicleId, sensorId
Fields: value (temperature or pressure)
Later when I want to use these values, I need addtional details about the specific values.
Note: that I currently have 20+ unique tags for each sensor measurement like:
unit of measure, size of vehicle, senor description, etc.
For example: I want to know the engine pressure in Kpa for all cars with four doors.
For example: I want to know the exhaust temperature in degrees C for truck 89.
I'd like to know what is concidered best practise when designing time series measurements (tables)?
1- Do I add more tags that provide the addition inforation directly to the measurement?
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code?
3- Other?
1-Do I add more tags that provide the additional information directly to the measurement? Yes you can do that but also keep in mind adding more tags also consume more memory. Please refer the system requirements in the following link
https://docs.influxdata.com/influxdb/v1.7/guides/hardware_sizing/
2- Do I keep the Vehicle and Sensor definitions in a relational table and join in code? No need if you implement the above, you can design a relation DB table for your entire need instead ok keeping two different databases.
I am reading a book "Modeling the agile data warehouse with data vault" by H. Hultgren. He states:
EDW represents what did happen - not what should have happened
When does the cleaning and possible transforming is performed? Under transforming I mean stadartization f the values, for example, sex column can contain only two possible values 'f' and 'm' and not 'female' or 'male' or 0 or 1)?
If you are importing data through ETL, that is one place to do it. Or you can use some other kind of data cleansing tool. This is a very general question. It depends on the architecture of your data warehouse.
For example you might have a data warehouse that loads data and tries to automatically clean it or you might have an architecture where every single 'bad' record goes to an approval area to be cleaned by a person. I can assure you in the real world, no business user wants to have to pick from 6 values for gender.
The other thing is you might be loading data from three different systems, and these three different representations are completely valid in each system, but an end user doesn't want to have to pick from 6 choices - they want the data to be cleansed.
I'm thinking maybe this statement
EDW represents what did happen - not what should have happened
is a data vault specific thing since DV is all about modelling and storing the source system data no matter how the schema changes, and I guess in this case you would treat the data vault as an ODS and preserve the data as-as, then cleanse it on the way into the reporting star schema
Time-series data such as historical stock prices are usually stored in an RDBMS.
I am evaluating various options to use this data, possibly store it in doc store or triple store in MarkLogic, and build some use cases on this data and/or along with the other kind of data stored in the doc/triple store.
Essentially, I am looking for ways to
Store time series data such as historical stock prices in a MarkLogic database.
Ways to query this data (stored in ML or queried across the RDBMS), through XQuery for example.
Ways to query this data, along with the other data stored in the doc/triple store.
I would appreciate any recommendations in this regard.
Added some more info...
I am trying to figure a neat way of capturing this data as triples. The idea being that it would be nice to link this data with other related data. For example, if the historical stock price we are trying to store is for HSBC listed on NYSE, then we can in some way define resources for HSBC and NYSE and also capture the stock price as literals (perhaps) and then link the resource HSBC with for example, the company information stored in dbpedia.
Essentially, I am talking about creating linked data, such that it is easy to query across data fetched from different sources and also if possible, try to use inferencing. For example, if I use this approach, it would be possible for me to run a query such as 'Get me the stock price of the companies headquartered in London, whose turnover is greater than $1billion'.
You have 2 alternatives. Either you have 1 big document for each series, or you have 1 document per price. The former is not recommended, as the latter let you better use the index system, especially a range index on the timestamp.
I worked on a system using MarkLogic, which was essentially a system to store time series. We used 1 document per point in the series (as well as 1 document for the series itself, for its "metadata", all information common across all the points in the series). We also put all documents relative to 1 series in 1 collection. We used a naming scheme for the document URIs based on the timestamp and a unique ID per series, so we can easily guarantee the uniqueness of the document URIs.
An important point is to have the series point documents to reference their series document (either explicitly or just by being in the same collection), instead of the other way around.
As per querying, it depends on your specific use cases, but typically you will use a search constraint on the collection to identify one (or several) series, and a range index on the timestamp to select a "slice" of points in the series. If you have use cases like selecting points based on their value (instead of their time) you can do it as efficiently as you do it based on the timestamp, by using a range index on the values themselves.
I would recommend storing time-series data in a time-series database: https://en.wikipedia.org/wiki/Time_series_database
Update 1:
You can define HSBC as an entity, specify meta-data for the entity such as location or headcount, and then store quarterly revenue and traded tick prices as separate time-series. Then you can run queries that a) filter by meta-data tag such as Location and filter by aggregation, e.g. MAX(price). I would store headcount as series as well actually. This way I can investigate correlations between different series for research and analytics.
There is a Java Swing application which uses an Informix database. I have user rights granted for the Swing application (i.e. no source code), and read only access to a mirror of the database.
Sometimes I need to find a database column, which is backing a GUI element (TextBox, TableField, Label...). What would be best approach to find out which database column and table is holding the data shown e.g. in a TextBox?
My general approach is to capture the state of the database. Commit a change using the GUI and then capture the state of the database again. Then I need to examine the difference. I've already tried:
Use the nrows field of systables: Didn't work, because the number in nrows does not seem to be a realtime representation of the row count.
Create a script with SELECT COUNT(*) ... for all tables: didn't work because too many tables (> 5000). Also tried to optimize by removing empty tables, but there are still too many left.
Is there a simple solution that I'm missing?
Please look at the Change Data Capture API and check if this suits your needs
There probably isn't a simple solution.
You probably need to build yourself a map of the database, or a data dictionary for it. It sounds as though you can eliminate many of the tables from consideration since they're empty — at least for a preliminary pass. If you're dealing with information in a text box, the chances are it is some sort of character data; you can analyze which (non-empty) tables which contain longer character strings, and they'd be the primary targets of your searches. If the schema is badly designed with lots of VARCHAR(255) columns even though the columns normally only hold short strings, life is more difficult. Over time, you can begin to classify tables and columns so that you end up knowing where to look for parts of the application.
One problem to beware of: the tabid in informix.systables isn't necessarily as stable as you'd like. Your data dictionary needs to record its own dd_tabid for the table it describes, and can store the last known tabid from informix.systables, but it needs to be ready to find a new tabid value on occasion. You should probably only mark data in your dictionary for logical deletion.
To some extent, this assumes you can create a database in which to record this information. If you can't create an Informix database, you may have to use something else (MySQL, or SQLite, perhaps) to store the data dictionary. Alternatively, go to your DBA team and ask them for the information. Unless you're trying something self-evidently untoward, they're likely to help (but politics can get in the way — I've no idea how collegial your teams are).
I have an OLTP database, and am currently creating a data warehouse. There is a dimension table in the DW (DimStudents) that contains student data such as address details, email, notification settings.
In the OLTP database, this data is spread across several tables (as it is a standard OLTP database in 3rd normal form).
There are currently 10,390 records but this figure is expected to grow.
I want to use Type 2 ETL whereby if a record has changed in the OLTP database, a new record is added to the DW.
What is the best way to scan through 10,000 records in the DW and then compare the results with the results in several tables contained in the OLTP?
I'm thinking of creating a "snapshot" using a temporary table of the OLTP data and then comparing the results row by row with the data in the Dimension table in the DW.
I'm using SQL Server 2005. This doesn't seem like the most efficient way. Are there alternatives?
Introduce LastUpdated into source system (OLTP) tables. This way you have less to extract using:
WHERE LastUpdated >= some_time_here
You seem to be using SQL server, so you may also try rowversion type (8 byte db-scope-unique counter)
When importing your data into the DW, use ETL tool (SSIS, Pentaho, Talend). They all have a componenet (block, transformation) to handle SCD2 (slowly changing dimension type 2). For SSIS example see here. The transformation does exactly what you are trying to do -- all that you have to do is specify which columns to monitor and what to do when it detects the change.
It sounds like you are approaching this sort of backwards. The typical way for performing ETL (Extract, Test, Load) is:
"Extract" data from your OLTP database
Compare ("Test") your extracted data against the dimensional data to determine if there are changes or whatever other validation needs to be performed
Insert the data ("Load") in to your dimension table.
Effectively, in step #1, you'll create a physical record via a query against the multiple tables in your OLTP database, then compare that resulting record against your dimensional data to determine if a modification was made. This is the standard way of doing things. In addition, 10000 rows is pretty insignificant as far as volume goes. Any RDBMS and ETL process should be able to process through that in a matter of no more than few seconds at most. I know SQL Server has DTS, although I'm not sure if the name has changed in more recent versions. That is the perfect tool for doing something like this.
Does you OLTP database have an audit trail?
If so, then you can query the audit trail for just the records that have been touched since the last ETL.