Finding Missing Values in datasets - machine-learning

I just want your help over a issue, about how I come to know that there are missing values specially in big data sets i.e. which columns having missing values and whose not?

This depends entirely on how the dataset is stored (if it’s at rest as a disk file), or what interface is it accessible through (SQL, graph query, etc).
If it’s a “plain file” like CSV, HDF, an Octave/Matlab matrix, then use whatever scripting tool you’re comfortable with to iterate the rows and check for missing values. If it’s an SQL dump, you can load it into SQLite or sql server and select for missing values. You could even use an SQL parser to directly report missing values from the SQL dump, since there’s really no need to persist it into a database.
If it’s live data behind an API, you can use the api to query the data for missing values – if the api supports such queries. Otherwise, use the api to export (dump) the entire data set, and query it at rest as in preceding paragraph. If the dataset doesn’t have indices that allow finding missing data, then you’ll expect the query to take long, and possibly have performance impact on the service that provides the data – act with care and understand the exact consequences of what you’re about to do.

This gives number of missing values of each column. Use your pandas dataframe instead of train.
train.isnull().sum()
Otherwise you can use train.info() or train.describe() for complete information or description of data, which also shows missing values in each column.

Number of missing values for entire dataset df.isnull().sum().sum()

Related

Find changes quickly in larger SQL database?

There is a Java Swing application which uses an Informix database. I have user rights granted for the Swing application (i.e. no source code), and read only access to a mirror of the database.
Sometimes I need to find a database column, which is backing a GUI element (TextBox, TableField, Label...). What would be best approach to find out which database column and table is holding the data shown e.g. in a TextBox?
My general approach is to capture the state of the database. Commit a change using the GUI and then capture the state of the database again. Then I need to examine the difference. I've already tried:
Use the nrows field of systables: Didn't work, because the number in nrows does not seem to be a realtime representation of the row count.
Create a script with SELECT COUNT(*) ... for all tables: didn't work because too many tables (> 5000). Also tried to optimize by removing empty tables, but there are still too many left.
Is there a simple solution that I'm missing?
Please look at the Change Data Capture API and check if this suits your needs
There probably isn't a simple solution.
You probably need to build yourself a map of the database, or a data dictionary for it. It sounds as though you can eliminate many of the tables from consideration since they're empty — at least for a preliminary pass. If you're dealing with information in a text box, the chances are it is some sort of character data; you can analyze which (non-empty) tables which contain longer character strings, and they'd be the primary targets of your searches. If the schema is badly designed with lots of VARCHAR(255) columns even though the columns normally only hold short strings, life is more difficult. Over time, you can begin to classify tables and columns so that you end up knowing where to look for parts of the application.
One problem to beware of: the tabid in informix.systables isn't necessarily as stable as you'd like. Your data dictionary needs to record its own dd_tabid for the table it describes, and can store the last known tabid from informix.systables, but it needs to be ready to find a new tabid value on occasion. You should probably only mark data in your dictionary for logical deletion.
To some extent, this assumes you can create a database in which to record this information. If you can't create an Informix database, you may have to use something else (MySQL, or SQLite, perhaps) to store the data dictionary. Alternatively, go to your DBA team and ask them for the information. Unless you're trying something self-evidently untoward, they're likely to help (but politics can get in the way — I've no idea how collegial your teams are).

How to speed up Redshift queries

I am using json_extract_path_text function to extract values from JSON. As row data increases, query takes long time to run and fails for some time.
Is there a way to reduce query execution time or improve josn_extract_path_text function
The solution is: store your data in tabular format instead of JSON. JSON is not a good choice for storing larger data sets because, by storing disparate data in a single column, JSON does not leverage Amazon Redshift’s column store architecture. Or alternatively change you node type to bigger one.
Redshift being a columnar store, storing data in JSON format would not speed up queries on it. This would work on a document model NOSQL database, but not on RedShift. To make RedShift queries efficient, distribution style(even for the scenario where data does not follow a speicifc order or is random) of the tables are important, based on the number of clusters you have. Also, Distribution key on the primary key column(in an otherwise RDBMS model), and Sort Key on the same would help you in Joins(it would use the Sort Merge join instead of the longer Hash Join).
For more details about this, do have a look at the documentation. RTFM is your friend here.

How to improve the performance in big table join?

Please help me out with this big data problem.
I have a very large table (500G) that stores cookie information collected from one website, and I try to provide service to many other clients. For each client, they have their cookies, so in the end I need to do query on 500G+300G(client_data).
Since some query use both my cookie data and client cookie data, it is possible that I need to do a join between my table and their table, therefore the performance is bad. To solve this problem, I put the entire 800GB data into a giant table. Since there is no join table, the performance is good. But when I expand my service to multiple client, it takes too much storage.
Current I am using Vertica as my data source, and use bitmap to store my information.
Any suggestion that can maintain my current performance but also support like 40 cients? My storage is about 12 TB and each client in current solution talkes 1.5T.
what I want is either a replacement of Vertica with can support bitmap operation and quick table join. Or a better way to represent my data.
My storage is about 12 TB and each client in current solution talkes 1.5T.
If you have 40 * 1.5TB worth of non-duplicated cookie data to store, there's no magic to make that fit into 12TB.
This will be an imprecise answer due to the lack of details about definitions, etc. But I would add the following about performance:
Look at your projection definitions. You may be able to get performance gains depending on what you put in the order by clause of the projection.
You have a few ways forward, depending on the specifics of your case. Point 1 and 3 are the easiest to deal with:
You can properly set projections, to make sure that both tables are identically segmented: https://my.vertica.com/docs/6.1.x/HTML/index.htm#12549.htm
You can set up pre join projections, where the join cost is paid during data load, not during data retrieval, see https://my.vertica.com/docs/6.1.x/HTML/index.htm#1299.htm
Make sure that your data type is the best possible. Matching on ints is faster than matching on strings, matching columns with low cardinality is faster than matching columns with high cardinality.
If 1 and 3 are well set, Vertica can actually apply filters before decompression, fastening a lot your query and thus using a lot less memory.

SSIS Merge Join component wrote 0 rows

First of all, thanks to the community for the amount of information on the site, helped me a lot with C# and SSIS. The second thing is that i'm not very good with english, so please be patient, if you don't understand something, please ask, i'll try to make it better.
I got 2 OLEDB connection source from different databases, both tables got a column with an ID that I use as a Join Key. In RUT CRUZADOS, the ID its a float datatype, while in the other source (CTACTE AÑO PAS) I don't know which type of data it is (I can't open the database with sql server, i can only do SELECT operations).
When I combine them in the Merge,it doesn't return me any mistake, but when I run the program, this happens.
[SSIS.Pipeline] Information: "component "CARGOS ABONOS" (239)" wrote 0
rows.
In Microsoft Access, the "Inner Join" returns like 4 Millions of rows. I think the problem its the metadata but i dont know how to use the "Data Conversion". Can someone help me please.
Thank you all
You can view the data types, at least as far as SSIS is concerned by double clicking on connector lines. In the Data Flow Path Editor that pops up, the Metadata tab will describe the column types.
That said, it doesn't matter because the Merge Join transformation is only going to allow you to merge data of the same type.
A Merge Join requires the source system data to be sorted. This is accomplished by either adding sort components into the stream (not recommended as this is an asynchronous transform that eats all your memory and kills your performance) or by explicitly sorting in your source systems and then marking them as sorted in the Advanced tab.
Since I don't see a Sort, that leads me to believe the sort is done in the source systems. Or, the sorts are not done there but someone has marked the output as sorted. There must be explicit ORDER BY clauses in those source queries. Sometimes, SQL Server will return data in the same order but unless there is an ORDER BY, it cannot be guaranteed. (I wish I could use the flash tag to emphasis the last point).
Future readers, if you have a sort in both systems and they are both sorted on the same column, then you need to examine collations. Case Insensitive is a different beast than Case Sensitive and a sort on an ASCII based system yields a different sort than one using EBCIDIC for mixed alpha-numeric like I once had...
As the source data type appears to be floats, then sorting is not the likely culprit. The realization is dawning on me, instead of sort issues, you likely have an uglier and more insidious comparison issue. Floating point numbers are approximations. 1=1 but 1.00000000000(etc) may or may not be equal to 1.0000000000(etc)1
Do you actually need the decimal places to make the match? If not, casting to an integer in both (and sorting on the CAST'ed value) systems should make these matches work. If there are decimal places that matter, then you're going to need to cast that into an exact numeric type (and pray that they both convert in the same way). The fact that Access does it leads me to believe Integer data type will be your salvation.

Using multiple key value stores

I am using Ruby on Rails and have a situation that I am wondering if is appropriate for using some sort of Key Value Store instead of MySQL. I have users that have_many lists and each list has_many words. Some lists have hundreds of words and I want users to be able to copy a list. This is a heavy MySQL task b/c it is going to have to create these hundreds of word objects at one time.
As an alternative, I am considering using some sort of key value store where the key would just be the word. A list of words could be stored in a text field in mysql. Each list could be a new key value db? It seems like it would be faster to copy a key value db this way rather than have to go through the database. It also seems like this might be faster in general. Thoughts?
The general way to solve this using a relational database would be to have a list table, a word table, and a table-words table relating the two. You are correct that there would be some overhead, but don't overestimate it; because table structure is defined, there is very little actual storage overhead for each record, and records can be inserted very quickly.
If you want very fast copies, you could allow lists to be copied-on-write. Meaning a single list could be referred to by multiple users, or multiple times by the same user. You only actually duplicate the list when the user tries to add, remove, or change an entry. Of course, this is premature optimization, start simple and only add complications like this if you find they are necessary.
You could use a key-value store as you suggest. I would avoid trying to build one on top of a MySQL text field in less you have a very good reason, it will make any sort of searching by key very slow, as it would require string searching. A key-value data store like CouchDB or Tokyo Cabinet could do this very well, but it would most likely take up more space (as each record has to have it's own structure defined and each word has to be recorded separately in each list). The only dimension of performance I would think would be better is if you need massively scalable reads and writes, but that's only relevant for the largest of systems.
I would use MySQL naively, and only make changes such as this if you need the performance and can prove that this method will actually be faster.

Resources