Arff file format weka - machine-learning

I am doing my machine learning homework and I am using Weka which I am very new to. I am trying to use M5P but the classifier is grayed out. I understand that means that the file im using is incorrect whether it be format or parameters. Can someone help me fix my arff file? I'm pretty sure the problem is within the attribute section.
Here it is.
#relation world_happiness
#attribute M5P
#attribute continent {Americas, Africa, Asia, Europe, Australia, Antarctica}
#attribute country string
#attribute SWL-ranking numeric
#attribute SWL-index numeric
#attribute life-expectancy numeric
#attribute GDP-per-capita numeric
#attribute access-to-education-score numeric
#data
Europe,'Albania',157,153.33,73.8,4.9,75.8
Africa,'Algeria',134,173.33,71.1,7.2,66.9
Africa,'Angola',149,160,40.8,3.2,?
Americas, 'Antigua And Barbuda',16,246.67,73.9,11,?
Americas,'Argentina',56,226.67,74.5,13.1,93.7
Europe,'Armenia',172,123.33,71.5,4.5,?
Australia,'Australia',26,243.33,80.3,31.9,?
Europe,'Austria',3,260,79,32.7,99.1
Asia,'Azerbaijan',144,163.33,66.9,4.8,80.2
Americas,'Bahamas',5,256.67,69.7,20.2,?
Asia,'Bahrain',33,240,74.3,23,102
Asia,'Bangladesh',104,190,62.8,2.1,53.7
Americas,'Barbados',27,243.33,75,17,101.1
Europe,'Belarus',170,133.33,68.1,6.9,94.2
Europe,'Belgium',28,243.33,78.9,31.4,145.4
Americas,'Belize',48,230,71.9,6.8,71.6
Africa,'Benin',122,180,54,1.1,21.8
Asia,'Bhutan',8,253.33,62.9,1.4,?
Americas,'Bolivia',117,183.33,64.1,2.9,?
Europe, 'Bosnia & Herzegovina',137,170,74.2,6.8,?
Africa,'Botswana',123,180,36.3,10.5,81.8
Americas,'Brazil',81,210,70.5,8.4,103.2
Asia, 'Brunei Darussalam',9,253.33,76.4,23.6,?
Europe,'Bulgaria',164,143.33,72.2,9.6,92
Africa, 'Burkina Faso',152,156.67,47.5,1.3,10
Asia,'Burma',130,176.67,60.2,1.7,?
Africa,'Burundi',178,100,43.6,0.7,?
Asia,'Cambodia',110,186.67,56.2,2.2,17.3
Africa,'Cameroon',138,170,45.8,2.4,?
Americas,'Canada',10,253.33,80,34,102.6
Africa, 'Cape Verdi',100,193.33,70.4,6.2,?
Africa, 'Central African Republic',145,163.33,39.3,1.1,?
Africa,'Chad',159,150,43.6,1.5,11.5
Americas,'Chile',71,216.67,77.9,11.3,87.5
Asia,'China',82,210,71.6,6.8,62.8
Americas,'Colombia',34,240,72.4,7.9,70.9
Africa,'Comoros',97,196.67,63.2,0.6,?
Africa, 'Congo Democratic Republic',176,110,43.1,0.7,18.4
Africa, 'Congo Republic',105,190,52,1.3,?
Americas, 'Costa Rica',13,250,78.2,11.1,50.9
Europe,'Croatia',98,196.67,75,11.6,?
Americas,'Cuba',83,210,77.3,3.5,?
Europe,'Cyprus',49,230,78.6,7.14,?
Europe, 'Czech Republic',77,213.33,75.6,19.5,87.9
Europe,'Denmark',1,273.33,77.2,34.6,?
Africa,'Dijbouti',150,160,52.8,1.3,14.7
Americas,'Dominica',29,243.33,75.6,5.5,?
Americas, 'Dominican Republic',42,233.33,67.2,7,?
Americas,'Ecuador',111,186.67,74.3,4.3,56.7
Africa,'Egypt',151,160,69.8,3.9,?
Americas, 'El Salvador',61,220,70.9,4.7,49.8
Africa, 'Equatorial Guinea',135,173.33,43.3,50.2,?
Africa,'Eritrea',162,146.67,53.8,1,28.2
Europe,'Estonia',139,170,71.3,16.7,107
Africa,'Ethiopia',153,156.67,47.6,0.9,5.2
Australia, 'Fiji',57,223.33,67.8,6,?
Europe,'Finland',6,256.67,78.5,30.9,124.5
Europe,'France',62,220,79.5,29.9,108.7
Africa,'Gabon',88,206.67,54.5,6.8,54.4
Africa,'Gambia',106,190,55.7,1.9,27
Europe,'Georgia',169,136.67,70.5,3.3,77.7
Europe,'Germany',35,240,78.7,30.4,99
Africa,'Ghana',89,206.67,56.8,2.5,37.3
Europe,'Greece',84,210,78.3,22.2,94.6
Americas,'Grenada',72,216.67,65.3,5,?
Americas,'Guatemala',43,233.33,67.3,4.7,32.7
Africa,'Guinea',140,170,53.7,2,?
Africa,'Guinea-Bissau',124,180,44.7,0.8,20.4
Americas,'Guyana',36,240,63.1,4.6,81
Americas,'Haiti',118,183.33,51.6,1.7,?
Americas,'Honduras',37,240,67.8,2.9,?
Asia, 'Hong Kong',63,220,81.6,32.9,?
Europe,'Hungary',107,190,72.7,16.3,98.6
Europe,'Iceland',4,260,80.7,35.6,108.8
Asia,'India',125,180,63.3,3.3,49.9
Asia,'Indonesia',64,220,66.8,3.6,?
Asia,'Iran',96,200,70.4,8.3,80
Europe,'Ireland',11,253.33,77.7,41,123.1
Asia,'Israel',58,223.33,79.7,24.6,93
Europe,'Italy',50,230,80.1,29.2,92.8
Africa, 'Ivory Coast',160,150,45.9,1.6,21.7
Americas,'Jamaica',44,233.33,70.8,4.4,83.6
Asia,'Japan',90,206.67,82,31.5,102.1
Asia,'Jordan',141,170,71.3,4.7,87.7
Asia,'Kazakhstan',101,193.33,63.2,8.2,87
Africa,'Kenya',112,186.67,47.2,1.1,?
Asia,'Kuwait',38,240,76.9,19.2,55.6
Asia,'Kyrgyzstan',65,220,66.8,2.1,83
Asia,'Laos',126,180,54.7,1.9,35.6
Europe,'Latvia',154,156.67,71.6,13.2,88.9
Asia,'Lebanon',113,186.67,72,6.2,78.2
Africa,'Lesotho',165,143.33,36.3,2.5,28
Africa,'Libya',108,190,73.6,11.4,?
Europe,'Lithuania',155,156.67,72.3,13.7,93.4
Europe,'Luxembourg',12,253.33,78.5,55.6,95.3
Europe,'Macedonia',146,163.33,73.8,7.8,?
Africa,'Madagascar',103,193.33,55.4,0.9,?
Africa,'Malawi',158,153.33,39.7,0.6,?
Asia,'Malaysia',17,246.67,73.2,12.1,98.8
Asia,'Maldives',66,220,66.6,3.9,42.7
Africa,'Mali',131,176.67,47.9,1.2,15
Europe,'Malta',14,250,78.4,19.9,90.4
Africa,'Mauritania',132,176.67,52.7,2.2,?
Africa,'Mauritius',73,216.67,72.2,13.1,107.3
Americas,'Mexico',51,230,75.1,10,73.4
Europe,'Moldova',175,116.67,67.7,1.8,?
Asia,'Mongolia',59,223.33,64,1.9,64.4
Africa,'Morocco',114,186.67,69.7,4.2,39.3
Africa,'Mozambique',127,180,41.9,1.3,13.9
Africa,'Namibia',74,216.67,48.3,7,59.8
Asia,'Nepal',119,183.33,61.6,1.4,53.9
Europe,'Netherlands',15,250,78.4,30.5,124.1
Australia,' New Zealand',18,246.67,79.1,25.2,112.9
Americas,'Nicaragua',85,210,69.7,2.9,?
Africa,'Niger',161,150,44.4,0.9,?
Africa,'Nigeria',120,183.33,43.4,1.4,?
Europe,'Norway',19,246.67,79.4,42.3,117
Asia,'Oman',30,243.33,74.1,13.2,67.8
Asia,'Pakistan',166,143.33,63,2.4,39
Asia,'Palestine',128,180,72.5,5.8,80.7
Americas,'Panama',39,240,74.8,7.2,68.7
Australia, 'Papua New Guinea',86,210,55.3,2.6,21.2
Americas,'Paraguay',75,216.67,71,4.9,56.9
Americas,'Peru',115,186.67,70,5.9,80.8
Asia,'Philippines',78,213.33,70.4,5.1,75.9
Europe,'Poland',99,196.67,74.3,13.3,?
Europe,'Portugal',92,203.33,77.2,19.3,112
Asia,'Qatar',45,233.33,72.8,27.4,92.4
Europe,'Romania',136,173.33,71.3,8.2,80.2
Europe,'Russia',167,143.33,65.3,11.1,81.9
Africa,'Rwanda',163,146.67,43.9,1.5,12.1
Australia, 'Samoa Western',52,230,70.2,5.8,76
Africa, 'Sao Tome And Principe',60,223.33,63,1.2,?
Asia, 'Saudi Arabia',31,243.33,71.8,12.8,68.5
Africa,'Senegal',116,186.67,55.7,1.8,19.5
Africa,'Seychelles',20,246.67,72.7,7.8,?
Africa, 'Sierra Leone',143,166.67,40.8,0.8,23.9
Asia,'Singapore',53,230,78.7,28.1,?
Europe,'Slovakia',129,180,74,16.1,86.6
Europe,'Slovenia',67,220,76.4,21.6,98.8
Australia, 'Solomon Islands',54,230,62.3,1.7,?
Africa, 'South Africa',109,190,48.4,12,90.2
Asia, 'South Korea',102,193.33,77,20.4,97.4
Europe,'Spain',46,233.33,79.5,25.5,112.8
Asia, 'Sri Lanka',93,203.33,74,4.3,?
Americas, 'St Kitts And Nevis',21,246.67,70,8.8,?
Americas, 'St Lucia',47,233.33,72.4,5.4,94.3
Americas, 'St Vincent And The Grenadines',40,240,71.1,2.9,?
Africa,'Sudan',173,120,56.4,2.1,28.8
Americas,'Suriname',32,243.33,69.1,4.1,50.7
Africa,'Swaziland',168,140,32.5,5,?
Europe,'Sweden',7,256.67,80.2,29.8,152.8
Europe,'Switzerland',2,273.33,80.5,32.3,99.9
Asia,'Syria',142,170,73.3,3.9,42
Asia,'Taiwan',68,220,76.1,27.6,?
Asia,'Tajikistan',94,203.33,63.6,1.2,76
Africa,'Tanzania',121,183.33,46,0.7,5.31
Asia,'Thailand',76,216.67,70,8.3,79
Asia,'Timor-Leste',69,220,65.5,0.4,?
Africa,'Togo',147,163.33,54.3,1.7,?
Australia,' Tonga',70,220,72.2,2.3,?
Americas, 'Trinidad And Tobago',55,230,69.9,16.7,78.4
Africa,'Tunisia',79,213.33,73.3,8.3,74.6
Europe,'Turkey',133,176.67,68.7,8.2,?
Asia,'Turkmenistan',171,133.33,62.4,8,?
Asia,'Uae',22,246.67,78,43.4,74.4
Africa,'Uganda',156,156.67,47.3,1.8,?
Europe,'Ukraine',174,120,66.1,7.2,92.8
Europe, 'United Kingdom',41,236.67,78.4,30.3,157.2
Americas,'Uruguay',87,210,75.4,9.6,91.6
Americas,'Usa',23,246.67,77.4,41.8,94.6
Asia,'Uzbekistan',80,213.33,66.5,1.8,?
Australia,' Vanuatu',24,246.67,68.6,2.9,28.5
Americas,'Venezuela',25,246.67,72.9,6.1,?
Asia,'Vietnam',95,203.33,70.5,2.8,64.6
Asia,'Yemen',91,206.67,60.6,0.9,?
Africa,'Zambia',148,163.33,37.5,0.9,25.5
Africa,'Zimbabwe',177,110,36.9,2.3,45.3

You don't need the M5P line. That's not an attribute. Just omit line 2.
Country has some problem: I get the message "Attribute is neither numeric or nominal". (I see you have it as string, so that's right). But when I remove the country attribute, then I can run M5P. (3 rules, correlation = .85).
Now, you may be thinking "but I want to keep track of what country my predictions are for". Here's how to do that:
First, set up the filtered classifier to remove attribute 2 (country) and run M5P:
Second, under more options, choose to Output predictions, choosing a format. Here I chose CSV (comma separated values), and then right clicked to select all attributes (first-last) to output.
Now Start the model. This gives you actual, predicted, and all the data, including the country name:

Related

BERT problem with context/semantic search in italian language

I am using BERT model for context search in Italian language but it does not understand the contextual meaning of the sentence and returns wrong result.
in below example code when I compare "milk with chocolate flavour" with two other type of milk and one chocolate so it returns high similarity with chocolate. it should return high similarity with other milks.
can anyone suggest me any improvement on the below code so that it can return semantic results?
Code :
!python -m spacy download it_core_news_lg
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
corpus = [
"Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
]
corpus_embeddings = model.encode(corpus)
queries = [
'latte al cioccolato', #milk with chocolate flavor,
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
Output :
======================
Query: latte al cioccolato
Top 10 most similar sentences in corpus:
Milka cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)
The problem is not with your code, it is just the insufficient model performance.
There are a few things you can do. First, you can try Universal Sentence Encoder (USE). From my experience their embeddings are a little bit better, at least in English.
Second, you can try a different model, for example sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1. It is based on ROBERTa and might give a better performance.
Now you can combine together embeddings from several models (just by concatenating the representations). In some cases it helps, on expense of much heavier compute.
And finally you can create your own model. It is well known that single language models perform significantly better than multilingual ones. You can follow the guide and train your own Italian model.

Create a .bib file from bibitem (thebibliography)

I have like 200 bibitem entry in the environment
\begin{thebibliography}
\bibitem{Bermudez} Berm\'udez, J.D., J. V. Segura y E. Vercher (2010). \emph{Bayesian forecasting with the Holt-Winters model}. Journal of the Operational Research Society, 61, 164-171.
\begin{thebibliography}
I want the resulting .bib file format
#article{bermudez2010bayesian,
title={Bayesian forecasting with the Holt--Winters model},
author={Berm{\'u}dez, Jos{\'e} D and Segura, Jos{\'e} Vicente and Vercher, Enriqueta},
journal={Journal of the Operational Research Society},
volume={61},
number={1},
pages={164--171},
year={2010},
publisher={Taylor \& Francis}
}
Is there a way I can do it without converting one by one
Regards
One possibility is to use https://text2bib.economics.utoronto.ca/ to convert the \bibitem into bibtex format. Choosing Spanish as language, the output of the conversion is
#article{Bermudez,
author = {Berm\'udez, J. D. and J. V. Segura and E. Vercher},
journal = {Journal of the Operational Research Society},
pages = {164-171},
title = {{B}ayesian forecasting with the Holt-Winters model},
volume = {61},
year = {2010},
}
Some fields are missing, e.g. the publisher, because this information was not contained in your \bibitem
You can use tex2bib, a tool based on text2bib, but migrated to a newest PHP version (PHP 7).
See an example of use
Input of text transformation:
\bibitem{Bermudez} Berm\'udez, J.D., J. V. Segura y E. Vercher (2010). \emph{Bayesian forecasting with the Holt-Winters model}. Journal of the Operational Research Society, 61, 164-171.
Output:
#article{bv10,
author = {Berm\'udez, J. D. and J. V. Segura y E. Vercher},
title = {Bayesian forecasting with the Holt-Winters model},
journal = {Journal of the Operational Research Society},
year = {2010},
volume = {61},
pages = {164-171},
}

WEKA Changing number of decimal places in predictions

I'm trying to get precise predictions from WEKA, and I need to increase the number of decimal places that it outputs for its prediction data.
My .arff training set looks like this:
#relation TrainSet
#attribute TimeDiff1 numeric
#attribute TimeDiff2 numeric
#attribute TimeDiff3 numeric
#attribute TimeDiff4 numeric
#attribute TimeDiff5 numeric
#attribute TimeDiff6 numeric
#attribute TimeDiff7 numeric
#attribute TimeDiff8 numeric
#attribute TimeDiff9 numeric
#attribute TimeDiff10 numeric
#attribute LBN/Distance numeric
#attribute LBNDiff1 numeric
#attribute LBNDiff2 numeric
#attribute LBNDiff3 numeric
#attribute Size numeric
#attribute RW {R,W}
#attribute 'Response Time' numeric
#data
0,0,0,0,0,0,0,0,0,0,203468398592,0,0,0,32768,R,0.006475
0.004254,0,0,0,0,0,0,0,0,0,4564742206976,4361273808384,0,0,65536,R,0.011025
0.002128,0.006382,0,0,0,0,0,0,0,0,4585966117376,21223910400,4382497718784,0,4096,R,0.01389
0.001616,0.003744,0,0,0,0,0,0,0,0,4590576115200,4609997824,25833908224,4387107716608,4096,R,0.005276
0.002515,0.004131,0.010513,0,0,0,0,0,0,0,233456156672,-4357119958528,-4352509960704,-4331286050304,32768,R,0.01009
0.004332,0.006847,0.010591,0,0,0,0,0,0,0,312887472128,79431315456,-4277688643072,-4273078645248,4096,R,0.005081
0.000342,0.004674,0.008805,0,0,0,0,0,0,0,3773914294272,3461026822144,3540458137600,-816661820928,8704,R,0.004252
0.000021,0.000363,0.00721,0,0,0,0,0,0,0,3772221901312,-1692392960,3459334429184,3538765744640,4096,W,0.00017
0.000042,0.000063,0.004737,0.01525,0,0,0,0,0,0,3832104423424,59882522112,58190129152,3519216951296,16384,W,0.000167
0.005648,0.00569,0.006053,0.016644,0,0,0,0,0,0,312887476224,-3519216947200,-3459334425088,-3461026818048,19456,R,0.009504
I'm trying to get predictions for the Response Time, which is the right-most column. As you can see, my data goes to the 6th decimal place.
However, WEKA's predictions only go to the 3rd. Here are the results of the file named "predictions":
inst# actual predicted error
1 0.006 0.005 -0.002
2 0.011 0.017 0.006
3 0.014 0.002 -0.012
4 0.005 0.022 0.016
5 0.01 0.012 0.002
6 0.005 0.012 0.007
7 0.004 0.018 0.014
8 0 0.001 0
9 0 0.001 0
10 0.01 0.012 0.003
As you can see, this greatly limits the accuracy of my predictions. For very small numbers less than 0.0005 (like row 8 and 9), they will show up as 0 instead of a more accurate smaller decimal number.
I'm using WEKA on the "Simple Command Line" instead of the GUI. My command to build the model looks like this:
java weka.classifiers.trees.REPTree -M 2 -V 0.00001 -N 3 -S 1 -L -1 -I 0.0 -num-decimal-places 6 \
-t [removed path]/TrainSet.arff \
-T [removed path]/TestSet.arff \
-d [removed path]/model1.model > \
[removed path]/model1output
([removed path]: I just removed the full pathname for privacy)
As you can see, I found this "-num-decimal-places" switch for creating the model.
Then I use the following command to make the predictions:
java weka.classifiers.trees.REPTree \
-T [removed path]/LUN0train.arff \
-l [removed path]/model1.model -p 0 > \
[removed path]/predictions
I can't use the "-num-decimal places" switch here because WEKA doesn't allow it in this case for some reason. "predictions" is my wanted predictions file.
So I do these two commands, and it doesn't change the number of decimal places in the prediction! It's still only 3.
I've already looked at this answers, Weka decimal precision, and this answer on the pentaho forum, but no one gave enough information to answer my question. These answers hinted that changing the number of decimal places might not be possible? but I just want to be sure.
Does any one know of an option to fix this? Ideally a solution would be on the command line, but if you only know how to do it in the GUI, that's ok.
I just figured a work around, which is to simply scale/multiply the data by 1000, and then get your predictions, and then multiply it back to 1/1000 when done to get the original scale. Kinda outside the box, but it works.
EDIT: An alternative way to do it: Answer from Peter Reutemann from http://weka.8497.n7.nabble.com/Changing-decimal-point-precision-td43393.html:
This has been around for a long time. ;-) "-p" is the really
old-fashioned way of outputting the predictions. Using the
"-classifications" option, you can specify what format the output is
to be in (eg CSV). The class that you specify with that option has to
be derived from
"weka.classifiers.evaluation.output.prediction.AbstractOutput":
http://weka.sourceforge.net/doc.dev/weka/classifiers/evaluation/output/prediction/AbstractOutput.html
Here is an example of using 12 decimals for the prediction output
using Java:
https://svn.cms.waikato.ac.nz/svn/weka/trunk/wekaexamples/src/main/java/wekaexamples/classifiers/PredictionDecimals.java

Is there any way to get abstracts for a given list of pubmed ids?

I have list of pmids
i want to get abstracts for both of them in a single url hit
pmids=[17284678,9997]
abstract_dict={}
url = https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
db=pubmed&id=**17284678,9997**&retmode=text&rettype=xml
My requirement is to get in this format
abstract_dict={"pmid1":"abstract1","pmid2":"abstract2"}
I can get in above format by trying each id and updating the dictionary, but to optimize time I want to give all ids to url and process and get only abstracts part.
Using BioPython, you can give the joined list of Pubmed IDs to Entrez.efetch and that will perform a single URL lookup:
from Bio import Entrez
Entrez.email = 'your_email#provider.com'
pmids = [17284678,9997]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
This gives as result:
{9997: 'Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction.',
17284678: 'Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.'}
Edit:
In the case of pmids without corresponding abstracts, watch out with the fix you suggested:
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0]
for pubmed_article in records['PubmedArticle'] if 'Abstract' in
pubmed_article['MedlineCitation']['Article'].keys()]
Suppose you have the list of Pubmed IDs pmids = [1, 2, 3], but pmid 2 doesn't have an abstract, so abstracts = ['abstract of 1', 'abstract of 3']
This will cause problems in the final step where I zip both lists together to make a dict:
>>> abstract_dict = dict(zip(pmids, abstracts))
>>> print(abstract_dict)
{1: 'abstract of 1',
2: 'abstract of 3'}
Note that abstracts are now out of sync with their corresponding Pubmed IDs, because you didn't filter out the pmids without abstracts and zip truncates to the shortest list.
Instead, do:
abstract_dict = {}
without_abstract = []
for pubmed_article in records['PubmedArticle']:
pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
article = pubmed_article['MedlineCitation']['Article']
if 'Abstract' in article:
abstract = article['Abstract']['AbstractText'][0]
abstract_dict[pmid] = abstract
else:
without_abstract.append(pmid)
print(abstract_dict)
print(without_abstract)
from Bio import Entrez
import time
Entrez.email = 'your_email#provider.com'
pmids = [29090559 29058482 28991880 28984387 28862677 28804631 28801717 28770950 28768831 28707064 28701466 28685492 28623948 28551248]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0] if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys() else pubmed_article['MedlineCitation']['Article']['ArticleTitle'] for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
print abstract_dict

Parsing author name, title, journal from unstructured references

I have a list of references. I'm trying to parse author name, title, journal name,volume no etc. All reference entries are not uniform. Some contain only title and multiple author names, some contain only title etc. How to do i go about parsing this and storing information in relevant columns? Few examples of the reference entries are as shown below.
Neufeld et al., ?Vascular endothelial growth factor (VEGF) and its receptors,?The FASEB Journal, vol. 13, pp. 9-22 (1999).
PCT ?International Search Report and Written Opinion? for International Application No. PCT/US08/60740, mailed , Aug. 19, 2008; 7 pages.
Wirth, et al. Interactions between DNA and mono-, bis-, tris-, tetrakis-, and hexakis(aminoacridines). A linear and circular dichroism, electric orientation relaxation, viscometry, and equilibrium study. J. Am. Chem. Soc. 1988; 110 (3):932-939.
Buadu LD, Murakami, J, Murayama S., et al., ?Breast Lesions: Correlation of Contrast Medium Enhancement Patterns on MR Images with Histophathological Findings and Tumor Angiogenesis,? Radiology 1996, 200:639-649.
Bers ?Excitation Contraction Coupling and Cardiac Contractile Force?, Internal Medicine, 237(2): 17, 1991, Abstract.
Abella, J., Vera, X., Gonzalez, A., ?Penelope: The NBTI-Aware Processor?, MICRO 2007, pp. 85-96.
JP Office Action dtd Dec. 2, 2010, JP Appln. 2008-273888, partial English translation.
Maruyama, H., et al., ?Id-1 and Id-2 are Overexpressed in Pancreatic Cancer and in Dysplastic Lesions in Chronic Pancreatitis,?American Journal of Pathology?155(3):815-822 (1999).
Attachment 2, High Speed Data RLP Lucent Technologies, Version 0.1, Jan. 16, 1997.
Diddens, Heyke et al. ?Patterns of Cross-Resistance to the Antigolate Drugs Trimetrexate, Metoprine, Homofolate, and CB3717 in Human Lymphoma and Osteosarcoma Cells Resistant to Methotrexate.? Cancer Research, Nov. 1983, pp. 5286-5292, vol. 43.
Installation drawings having drawing No. 1069965, dated Aug. 14, 1999 (3 pages).
Means et al., Chemical modifications of proteins: history and applications, Bioconjugate Chem., 1:2-12 (1990).
Bock, ?Natural History of Severe Reactions to Foods in Young Children,?J. Pediatr. 107: 676-680, 1985.
Chankhunthod, Anawat, et al., ?A Hierarachical Internet Object Cache,? in Proceedings of the USENIX 1996 Annual Technical Conference; San Diego, CA., (Jan. 1996), pp. 153-163.

Resources