CSV data analysis with hidden variable's names - analysis
I have an analytic problem.
In order to do an interview, the employer gave me a sample of data (csv data) without any indication of the names of variables. There only numeric data. He tolds me to analyze the data and found the relationships betwen data. As the first time I work with data without variable's name. Can you help me to found the best way to analyze it.
Here the 16 first line of the data.
Thank you very much
BEsts!
"";"A";"B";"C";"D";"E";"G";"H";"I";"J"
"1";0,448161832988262;2;2;114,646534434721;3,30110318594957;-19,599488176302;234,434198671982;20,6532414748726;15,1932268216357
"2";0,432204742450267;1;2;85,7605207392913;3,62111444347777;-1,15239499472225;178,211501645965;16,9600753825141;-0,112499563201146 "3";0,568398675648496;1;2;71,6164756272189;7,69368362802501;24,7537090689179;153,913075791719;28,2888808213224;-0,104468251047106
"4";0,349178678821772;2;2;121,170146136389;3,37847601607964;11,7925265579597;252,551708711339;18,621791522593;6,38200554369286
"5";0,537820372032002;2;1;63,9725158637857;8,09451459803737;45,2679414663975;142,05393538719;22,3210943859884;2,63072085022351
"6";0,311805972829461;2;1;105,86608851945;7,00105784025606;56,7118971691138;207,671845532346;28,1108882120601;6,91708052471079
"7";0,188992844894528;2;1;91,4349746370189;4,22115610670039;113,719882708266;191,935903422313;27,722226329035;1,96456921716711
"8";0,220268326113001;2;1;96,9513989045572;3,30097548353886;41,9359256486865;193,983671256507;21,1632614867063;0,897974894012737
"9";0,745850879931822;2;2;117,920731392395;0,708984633432214;12,0311471051253;252,479427305587;16,1589645742755;13,4663055003645
"10";0,685712045058608;2;1;114,761284685084;-2,05310178438932;78,7495516682929;241,262975712287;29,0455805154816;0,628022028133988
"11";0,288209964288399;2;1;72,581830232794;6,64981545368408;45,2083742548179;148,12776449076;27,9515510556148;7,63068446314225
"12";0,837000070838258;2;1;92,5181661942909;0,0636188206698255;41,882160754504;199,975615311369;13,5301385749224;2,80857064980119
"13";0,725813006050885;2;2;119,512257003768;-1,14732052387933;13,9914669274313;243,515192596543;13,7178765507419;10,7559454559754
"14";0,448982792440802;2;1;76,9548634545701;6,22868004513269;36,4101586275915;159,127996500245;25,6106129292056;20,5549219150393
"15";0,000250126933678985;2;2;78,7042365624531;6,87908300161039;33,3968162297511;155,114586653932;25,8772230949845;2,66564733215586
"16";0,430642496794462;2;1;60,8333601287592;6,01562443738694;38,8547948159146;122,277246640109;22,1027759089393;4,64118405160815
The first thing I would do would be to plot the data into a line chart, with the record index (first column in your data) along the X axis, and data for each column plotted as a separate line. That will give you first idea about the "behavior" of each column, and hopefully also show some relations between the columns.
The next steps will proceed from what you can figure out from this first glance.
Related
Creating Dynamic Sheet Cell Reference List for pulling numbers to SUM
I've been working on building a data analysis sheet, which is quite verbose at the moment and a bit more complicated than it should be as I've been trying to figure this out. Please note, I work doing student data in a school. Basically, I have two sets of input data: Data imported from a CSV file that includes test data and codes for Common Core Standards and the questions tied to those standards as a whole class summary Data imported from a CSV file that includes individual scores by question I am looking to construct 2 views: A view that collates and displays data of individual standards per student that includes a dropdown to change the standard allowing a teacher to see class performance by standard in a broad view. The drop-down is populated dynamically from the input data (so staff could eventually dump data and go directly to reports) A view that collates and displays data of individual students broken down by performance on each standard allowing a teachers to see the broader spectrum for each student. The student drop-down is populated from Source list 2. I have been able to build the first view, but am struggling with the second. I've been able to separate the question codes and develop strings of cell references to the scoring data, including a dynamic reference to the row the selected student's score data appears on in the second source set from above. I tried to pass through an indirect() formula into a sum() so as to process for a mean evaluation, and have encountered errors. I think SUM() doesn't process comma-separated cell reference lists from Indirect() [or in general] or there is something that I am missing to help parse it. Here is the formula I have tried: =Sum(vlookup(D7,CCCodeManip!$A:$C,3,false)) CCCodeManip!C:C includes the created text (based on the dynamic standards and question codes, etc), here's an example of what would be found there: 'M-ADI'!M17, 'M-ADI'!N17, 'M-ADI'!O17, 'M-ADI'!P17, 'M-ADI'!Q17, 'M-ADI'!R17, 'M-ADI'!J17 I need these to be dynamic so that teachers can input different sets of standards, question, and student data and the sheet automatically collates and reports it in uniform ways (with an upward bound of 20 standards as I currently have it built) Here is a link to the sheet I built, with names and ID anonymized. There's a CRAP TON of sub-tabs, and that's really just being able to split apart and re-combine data neatly without things error-ing out due to data overlapping, aside from a few different attempts and different approaches to parse the cell reference strings. The first two tabs are the current status of the data views. I plan to hide a bunch of the functional stuff that is there to help pull data accurately. The 3rd and 4th tab are the source data sets. 5th is a modified version of source data that allows me to reference things better, and I've tried to arrange the sheets most relevant towards the front of the set. https://docs.google.com/spreadsheets/d/1fR_2n60lenxkvjZSzp2VDGyTUO6l-3wzwaV4P-IQ_5Y/edit?usp=sharing Some have a different approach? I am aware that I might be as far as I cn go with this and perhaps should consider scripts - my coding experience is a bit out of date and my strength is more with the formulas, but I can dig into things with some direction, if anyone can help.
Ok so I noticed something. It seems the failure is in the indirect reference: =indirect(CCCodeManip!C3) The string I am trying to parse via indirect is going to be generated into something like this, dynamic from reference to other data: 'M-ADI'!M17, 'M-ADI'!N17, 'M-ADI'!O17, 'M-ADI'!P17, 'M-ADI'!Q17, 'M-ADI'!R17, 'M-ADI'!J17 The indirect returns the error that the above string is not a cell reference with the #REF code. Can someone give me a clue as to what is causing this? I am going to dig into the docs on Indirect() from google and will post anything that I find. Perhaps it is that indirect() can't handle lists, but only specific references and arrays, which may require me a to build a sheet to do the SUM formula on for each question set (?)
So I think I figured it out, but i Ended up parsing the data differently, basically doing the sum based on individual cell references and a separate sum formula, bypassing the need to do it all at once, it jsut makes my sheets a lot dirtier! I am eventually going to see if code could do it better if I need to, but this is closed for now. Basically, I did individual cell references to recall scores in a row, then used a separate SUM formula, and created references / structures to be able to pull those sum() results. Achieves the same end, but with extra crap on the sheet.
Empty Row Labels in PivotTable after adding descriptions as a measure
I'm trying to use Powerpivot to help me summarize Actual and Committed Costs from a project. After some manipulations in Power query I managed to get the right structure, but getting it in a powerpivottable is still a disaster. As I also want to include Descriptions but keep the nice indented powerpivot Layout, I used a formula to convert the description to a measure using DAX. I found a formula online and tweeked it a bit so it would also work when there was more than 1 hit: =IF(COUNTROWS(VALUES('Table1'[Description]))>1,FIRSTNONBLANK('Table1'[Description],0),VALUES('Table1'[Description])) This gave a more satisfying result However there were two set-backs: There are a lot of copies of the same descriptions creating empty Row labels (I don't want any Empty Row Labels) When taking the description from the source table it doesn't take the result from the first match for the row label in my table. How can I solve this problem?
elasticsearch - time series - search time intersection between two indexes
I have two indexes. Each one of them has time and value. click to see the data structure In the example above I would like to find a specific time where Index2.val-Index1.val>70 Note that the values do not change from the last time entry which means that if a value is set to 20 on the 1-1-14 it will be the same on the 2-1-14 if no entry exists. A solution can be fetching both of the vectors and do it with a linear algorithm but I suppose that the performance will be bad. Is there an out of the box solution for that? Thanks David
Google Spreadsheet a column mirrors a different columns values when it shouldnt be
Ive been working on a spreadsheet to examine something. I was converting a recursive formula into a linear one. The initial value or first state was 60. To get to the second state you 240. This is done by taking the previous state, doubling it, and adding 120. Which leads to the 3rd and 4th states of 600 and 1320. Now this base formula was clear enough that the (60+120)*2^(n-1)-120 accurately expresses it. My second part comes from needing to add in the ability to decrease the costs while still staying true to the state. So the last formula only works when the cost reduction is 0. After considerable effort (I kept having minor rounding errors) I arrived at (ROUND(60-60*0.015*B$2)+(120-round(120*rounddown(B$2/3)*0.03,0)))*2^($A2-1)-(120-round(120*rounddown(B$2/3)*0.03,0)). To test the formulas I created a table with the following values, with a2=1 to a5=4, and b2=0 to h2=6. Now I was using google spreadsheets to examine the information. When I populate the table I found that all the values were correct with the formula, except on G. On G the values are identical to F. So to try and correct this I have deleted the information from the cells, deleted the columns, and even tried again in a new spreadsheet. But in all cases G=F when it should not. I cant figure out why I'm getting a duplicate column. The information on row 3 is the values that it should be using. The expected values are G4=55, G5=226, G6=568, G7=1252.
In case anyone wanted to know, I finally managed to solve the issue. I needed to round in one more place. The following is the formula that has worked for my current testing. sum((ROUND(60-round(60*0.015*A$2))+(120-round(120*rounddown(A$2/3)*0.03,0)))*2^($A25-1)-(120-round(120*rounddown(A$2/3)*0.03,0)))
SPSS Frequency Plot Complication
I am having a hard time generating precisely the frequency table I am looking for using SPSS. The data in question: cases (n = ~800) with categorical variables DX_n (n = 1-15), each containing ICD9 codes, many of which are the same code. I would like to create a frequency table that groups the DX_n variables such that I can view frequency of every diagnosis in this sample of cases. The next step is to test the hypothesis that the clustering of diagnoses in this sample is different than that of another. If you have any advice as to how to test this, that would be really appreciated as well! Thanks! Edit: My attempts: 1) Analyze -> Descriptive Statistics -> Frequencies; then add variables DX_n (1-15) and display frequency charts. The output is frequencies of each ICD9 code per DX_n variable (so 15 tables are generated - I'm hoping to just have one grouped table). 2) I tried adjusting the output format to organize by variable and also to compare variables but neither option gives the output I'm looking for.
I think what you are looking for CTABLES. It can do parallel columns of frequencies, and it includes a column proportions test that can see whether the distributions differ
Thank you, JKP! You set me on exactly the right track. I'm not sure how I overlooked that menu. Just to clarify in case anyone else comes along needing to figure this out: Group diagnosis variables into a multiple response set using Analyze > Custom Tables > Multiple Response Sets. Code the variables as categories. http:// i.imgur.com/ipE9suf.png Create a custom table with your new multiple response set as a row and the subsets to compare as columns. I set summary statistics to compute from rows and added the column n% column (sorted descending). http:// i.imgur.com/hptIkfh.png Under test statistics, include a column proportions z-test as JKP suggested. http:// i.imgur.com/LYI6ZRl.png Behold, your results: http:// i.imgur.com/LgkBA8X.png Thanks again, and best of luck to anyone else who runs across this. -GCH p.s. Sorry everyone, I was going to post images but don't have enough reputation points yet. Images detailing the steps in the GUI can be found at the obfuscated links above.