Can p-value tells which variable is the major factor to the outcome? - p-value

I have trouble dealing with one of the task in this data from kaggle: to find the major factor in test outcomes, the factors are all categorical.
Can p-value in stats.chi2_contingency() tells which variable has the greatest impact on each test outcomes?(lower p-value indicates that there is stronger relationship between two variables, factor and test outcome)
Is it the right method to use? Thanks!!
sub = ['math', 'reading', 'writing']
factors = list(df.columns[:5])
for x in sub:
for y in factors:
subject = np.where(df[f'{x} score']>=60, 1, 0) #score above 60 means good.
obseved = pd.crosstab(index = df[f'{y}'], columns=subject)
p_value = stats.chi2_contingency(observed=obseved)[1]
print(f'Subject: {x}, Factor: {y}, P-value: {p_value}')
print('===========')
result:
Subject: math, Factor: gender, P-value: 4.327453315859715e-07
Subject: math, Factor: race/ethnicity, P-value: 8.606440583769172e-09
Subject: math, Factor: parental level of education, P-value: 1.3038966941106568e-06
Subject: math, Factor: lunch, P-value: 7.991906038402147e-23
Subject: math, Factor: test preparation course, P-value: 0.03659940205943433
===========
Subject: reading, Factor: gender, P-value: 0.00021136247670793947
Subject: reading, Factor: race/ethnicity, P-value: 0.0013115634444564267
Subject: reading, Factor: parental level of education, P-value: 0.00040600676669462395
Subject: reading, Factor: lunch, P-value: 8.774296142015978e-11
Subject: reading, Factor: test preparation course, P-value: 3.060148315231233e-10
===========
Subject: writing, Factor: gender, P-value: 6.85909966324548e-08
Subject: writing, Factor: race/ethnicity, P-value: 0.00117283769110293
Subject: writing, Factor: parental level of education, P-value: 8.278387435600882e-05
Subject: writing, Factor: lunch, P-value: 1.5790148042752678e-16
Subject: writing, Factor: test preparation course, P-value: 4.553365478101716e-15
===========
right way to test which variable has the most impact on outcome.

Related

Coefficients and Confidence Intervals - GLM Binomial (Logit)

I've run an Interrupted Time Series Analysis using a Binomial logistic regression in R.
glm(`Subject Refused Ratio` ~ Quarter + int2 + time_since_intervention2 , df, family = "binomial"(link='logit'), weights = sub_weight)
I want to derive the coefficients and confidence intervals for each of my outcomes and am currently doing so with the margins package, with the following outcome:
summary(margins(rrfit1a))
factor AME SE z p lower upper
int2 0.0963 0.1064 0.9050 0.3654 -0.1122 0.3047
Quarter -0.0006 0.0049 -0.1162 0.9075 -0.0101 0.0089
time_since_intervention2 -0.0056 0.0209 -0.2695 0.7875 -0.0466 0.0353
These seem largely consistent with the modelled data. For example it suggests the intervention (int2) could range between a 0.11 decrease and 0.30 increase.
However, I really need to add similar coefficient values and confidence intervals for the original Intercept. I have tried to do so using simple exp(coefficients) and the confint function within the MASS package. But the outcome doesn't quite tie in with what I would anticipate seeing.
exp(coefficients(rrfit1a))
(Intercept) Quarter int2 time_since_intervention2
0.9093160 0.9977377 1.4720697 0.9776187
For context the fitted value of the model in the first observation is around 0.47, which looks correct. So I wonder whether it is just a case of me misinterpreting the above or is there something more fundamental wrong with it?
Secondly, the confint outcome is:
> confint(rrfit1a, level = 0.90)
Waiting for profiling to be done...
5 % 95 %
(Intercept) -0.38990085 0.19896064
Quarter -0.03437363 0.02981353
int2 -0.31682909 1.09669529
time_since_intervention2 -0.16144941 0.11569710
This isn't what we'd expect to see or what our plotted confidence intervals look anything like.

Parsing author name, title, journal from unstructured references

I have a list of references. I'm trying to parse author name, title, journal name,volume no etc. All reference entries are not uniform. Some contain only title and multiple author names, some contain only title etc. How to do i go about parsing this and storing information in relevant columns? Few examples of the reference entries are as shown below.
Neufeld et al., ?Vascular endothelial growth factor (VEGF) and its receptors,?The FASEB Journal, vol. 13, pp. 9-22 (1999).
PCT ?International Search Report and Written Opinion? for International Application No. PCT/US08/60740, mailed , Aug. 19, 2008; 7 pages.
Wirth, et al. Interactions between DNA and mono-, bis-, tris-, tetrakis-, and hexakis(aminoacridines). A linear and circular dichroism, electric orientation relaxation, viscometry, and equilibrium study. J. Am. Chem. Soc. 1988; 110 (3):932-939.
Buadu LD, Murakami, J, Murayama S., et al., ?Breast Lesions: Correlation of Contrast Medium Enhancement Patterns on MR Images with Histophathological Findings and Tumor Angiogenesis,? Radiology 1996, 200:639-649.
Bers ?Excitation Contraction Coupling and Cardiac Contractile Force?, Internal Medicine, 237(2): 17, 1991, Abstract.
Abella, J., Vera, X., Gonzalez, A., ?Penelope: The NBTI-Aware Processor?, MICRO 2007, pp. 85-96.
JP Office Action dtd Dec. 2, 2010, JP Appln. 2008-273888, partial English translation.
Maruyama, H., et al., ?Id-1 and Id-2 are Overexpressed in Pancreatic Cancer and in Dysplastic Lesions in Chronic Pancreatitis,?American Journal of Pathology?155(3):815-822 (1999).
Attachment 2, High Speed Data RLP Lucent Technologies, Version 0.1, Jan. 16, 1997.
Diddens, Heyke et al. ?Patterns of Cross-Resistance to the Antigolate Drugs Trimetrexate, Metoprine, Homofolate, and CB3717 in Human Lymphoma and Osteosarcoma Cells Resistant to Methotrexate.? Cancer Research, Nov. 1983, pp. 5286-5292, vol. 43.
Installation drawings having drawing No. 1069965, dated Aug. 14, 1999 (3 pages).
Means et al., Chemical modifications of proteins: history and applications, Bioconjugate Chem., 1:2-12 (1990).
Bock, ?Natural History of Severe Reactions to Foods in Young Children,?J. Pediatr. 107: 676-680, 1985.
Chankhunthod, Anawat, et al., ?A Hierarachical Internet Object Cache,? in Proceedings of the USENIX 1996 Annual Technical Conference; San Diego, CA., (Jan. 1996), pp. 153-163.

Biopython - Big Discrepancy Calculating RNA melting Temperature over Literature

I experience big discrepancies when calculating melting temperature of RNA 7-mers with Biopython over values generated by a popular algorithm.
I tried the nearest neighbour algorithm with RNA and salt concentrations as described in a respective paper (thermodynamic table used as in paper below from: Freier et al 1986). Yet, the values largely differ (execute code below to see).
I tried all seven salt correction methods provided by Biopython, still I never get close to the values generated by siRNA design algorithm for the same 7-mers.
Can someone tell me how accurate Biopython's melting temperature nearest neighbour algorithm is? Especially for short oligomers like my 7-mers? Is there maybe something I am implementing wrong? Any suggestions?
Values derived from executing sample input:
http://sidirect2.rnai.jp/
Tm is given for the seed duplex of the guide strand: bases 2-7
Literature:
"Thermodynamic stability and Watson–Crick
base pairing in the seed duplex are major
determinants of the efficiency of the
siRNA-based off-target effect"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2602766/pdf/gkn902.pdf
from Bio.Seq import Seq
from Bio.SeqUtils import MeltingTemp
test_list = [
('GGAUUUG', 21.5),
('CUCAUUG', 18.1),
('CAUAUUC', 8.7),
('UUUGAGU', 19.2),
('UUUUGAG', 12.2),
('GUUUCAA', 14.9),
('AGUUUCG', 19.7),
('GAAGUUU', 13.3)
]
for t in test_list:
myseq = Seq(t[0])
tm = MeltingTemp.Tm_NN(myseq, dnac1=100, Na=100, nn_table=MeltingTemp.RNA_NN1, saltcorr=7) # NN1 = Freier et al (1986)
tm = round(tm, 1) # round to one decimal
print 'BioPython Tm: ' + str(tm) + ' siDirect Tm: ' + str(t[1])
I answered the question at biology.stackexchange and Biostars. In short: It seems that siDirect calculates the Tm wrong due to using a 1000fold higher primer concentration.

SPSS Statistics GPL Displaying percentage bars for multi-coded variables

I am trying to create a frequency chart to show the percentages from a multi-response set BDecideCX1 to BDecideCX9, but the percentages are based on the total number of codes rather than the total number of cases. I've tried using a BASE command to repercentage on a different base on the ELEMENT below, but with no success. Any help much appreciated.
TEMP.
SELECT IF BDecideCX1>=0.
MRSETS
/MDGROUP NAME=$temp
VARIABLES = BDecideCX1 to BDecideCX9
VALUE=1
LABEL="Who was involved in deciding how to spend the PE and sport premium?".
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=$temp RESPONSES() [NAME="RESPONSES"]
MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE TEMPLATE=["TCharts\FreqPurple.SGT"].
BEGIN GPL
SOURCE: s = userSource(id("graphdataset"))
DATA: temp=col(source(s), name("$temp"), unit.category())
DATA: responses=col(source(s), name("RESPONSES"))
SCALE: linear(dim(2), include(0))
GUIDE: text.title(label("Who was involved in deciding how to spend the PE and sport premium? FREQUENCY"))
GUIDE: axis(dim(2), label("%"))
GUIDE: axis(dim(1), label("Who was involved in deciding how to spend the PE and sport premium?"))
ELEMENT: interval(position(summary.percent(temp*responses)))
END GPL.

Lineplot of proportions over year in SPSS

Assume I have the following data
DATA LIST FREE / sex (A) year.
BEGIN DATA
m 2011
m 2011
m 2012
f 2011
f 2011
f 2011
f 2011
f 2012
f 2012
END DATA.
How can I plot a line of how the proportions of males and females change over the years.
Not the absolute values and not the total percentages, but the percentages per year.
I also need a crosstab where the percentages per year are shown.
A syntax would be nice, thank you.
The crosstabs syntax would simply be CROSSTABS TABLE Year By Sex /CELLS = Col.. The graph you want you can actually build through the GUI, to use the summary functions per year though you need to specify the year variable as either ordinal or nominal.
Here is the GGRAPH code the GUI printed out for me. Clean up as needed.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=year[LEVEL=ORDINAL] COUNT()[name="COUNT"] sex
MISSING=LISTWISE REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: year=col(source(s), name("year"), unit.category())
DATA: COUNT=col(source(s), name("COUNT"))
DATA: sex=col(source(s), name("sex"), unit.category())
GUIDE: axis(dim(1), label("year"))
GUIDE: axis(dim(2), label("Percent"))
GUIDE: legend(aesthetic(aesthetic.color.interior), label("sex"))
SCALE: linear(dim(2), include(0))
ELEMENT: line(position(summary.percent(year*COUNT, base.coordinate(dim(1)))),
color.interior(sex), missing.wings())
END GPL.

Resources