Parsing author name, title, journal from unstructured references

Parsing author name, title, journal from unstructured references - parsing

I have a list of references. I'm trying to parse author name, title, journal name,volume no etc. All reference entries are not uniform. Some contain only title and multiple author names, some contain only title etc. How to do i go about parsing this and storing information in relevant columns? Few examples of the reference entries are as shown below.
Neufeld et al., ?Vascular endothelial growth factor (VEGF) and its receptors,?The FASEB Journal, vol. 13, pp. 9-22 (1999).
PCT ?International Search Report and Written Opinion? for International Application No. PCT/US08/60740, mailed , Aug. 19, 2008; 7 pages.
Wirth, et al. Interactions between DNA and mono-, bis-, tris-, tetrakis-, and hexakis(aminoacridines). A linear and circular dichroism, electric orientation relaxation, viscometry, and equilibrium study. J. Am. Chem. Soc. 1988; 110 (3):932-939.
Buadu LD, Murakami, J, Murayama S., et al., ?Breast Lesions: Correlation of Contrast Medium Enhancement Patterns on MR Images with Histophathological Findings and Tumor Angiogenesis,? Radiology 1996, 200:639-649.
Bers ?Excitation Contraction Coupling and Cardiac Contractile Force?, Internal Medicine, 237(2): 17, 1991, Abstract.
Abella, J., Vera, X., Gonzalez, A., ?Penelope: The NBTI-Aware Processor?, MICRO 2007, pp. 85-96.
JP Office Action dtd Dec. 2, 2010, JP Appln. 2008-273888, partial English translation.
Maruyama, H., et al., ?Id-1 and Id-2 are Overexpressed in Pancreatic Cancer and in Dysplastic Lesions in Chronic Pancreatitis,?American Journal of Pathology?155(3):815-822 (1999).
Attachment 2, High Speed Data RLP Lucent Technologies, Version 0.1, Jan. 16, 1997.
Diddens, Heyke et al. ?Patterns of Cross-Resistance to the Antigolate Drugs Trimetrexate, Metoprine, Homofolate, and CB3717 in Human Lymphoma and Osteosarcoma Cells Resistant to Methotrexate.? Cancer Research, Nov. 1983, pp. 5286-5292, vol. 43.
Installation drawings having drawing No. 1069965, dated Aug. 14, 1999 (3 pages).
Means et al., Chemical modifications of proteins: history and applications, Bioconjugate Chem., 1:2-12 (1990).
Bock, ?Natural History of Severe Reactions to Foods in Young Children,?J. Pediatr. 107: 676-680, 1985.
Chankhunthod, Anawat, et al., ?A Hierarachical Internet Object Cache,? in Proceedings of the USENIX 1996 Annual Technical Conference; San Diego, CA., (Jan. 1996), pp. 153-163.

Related

BERT problem with context/semantic search in italian language

I am using BERT model for context search in Italian language but it does not understand the contextual meaning of the sentence and returns wrong result.
in below example code when I compare "milk with chocolate flavour" with two other type of milk and one chocolate so it returns high similarity with chocolate. it should return high similarity with other milks.
can anyone suggest me any improvement on the below code so that it can return semantic results?
Code :
!python -m spacy download it_core_news_lg
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
corpus = [
"Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
]
corpus_embeddings = model.encode(corpus)
queries = [
'latte al cioccolato', #milk with chocolate flavor,
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
Output :
======================
Query: latte al cioccolato
Top 10 most similar sentences in corpus:
Milka cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)

The problem is not with your code, it is just the insufficient model performance.
There are a few things you can do. First, you can try Universal Sentence Encoder (USE). From my experience their embeddings are a little bit better, at least in English.
Second, you can try a different model, for example sentence-transformers/xlm-r-distilroberta-base-paraphrase-v1. It is based on ROBERTa and might give a better performance.
Now you can combine together embeddings from several models (just by concatenating the representations). In some cases it helps, on expense of much heavier compute.
And finally you can create your own model. It is well known that single language models perform significantly better than multilingual ones. You can follow the guide and train your own Italian model.

Accounting for time with repeated-measures in lmer when not interested in time

I am trying to conduct a repeated-measures mixed-effects test with lmer and lmerTest, but I am not sure if I am doing it appropriately.
I have 6 sites with 3 plots per site that have been sampled once per year for 24 consecutive years. I have several environmental and species variables, but for simplicity, let's say I have two environmental variables (depth and temperature) and two species (species 1 and species 2). I am not interested in the time variable, changes with time, or the interactions, as this system has strong wet/dry seasonality where the effects of the dry season outweigh carry over effects of species from the prior year. I do not necessarily have data for all variables and plots every year, with some plots not sampled at times.
The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables.
Is it appropriate to include year as its own random effect in the model, along with plot within site?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
For this particular analysis, there were 435 total observations (plot/year), but I worry that it is not appropriately conducting repeated-measures.
anova(model1)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 0.0221 0.0221 1 145.75 0.0908 0.7635
temperature 9.0213 9.0213 1 422.19 37.0429 2.596e-09 ***
species2 0.0597 0.0597 1 418.95 0.2450 0.6208
This does not seem right. Is the a better way to incorporate year, or should I include year at all?
If I exclude year, why does the DenDF for depth change so drastically?
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
Type III Analysis of Variance Table with Satterthwaite's method
Sum Sq Mean Sq NumDF DenDF F value Pr(>F)
depth 2.599 2.599 1 431.77 7.1096 0.007955 **
temperature 58.788 58.788 1 432.10 160.7955 < 2.2e-16 ***
species2 0.853 0.853 1 429.62 2.3336 0.127343
summary(M1)
Linear mixed model fit by maximum likelihood . t-tests use Satterthwaite's method ['lmerModLmerTest']
Formula: species1 ~ depth + temperature + species2 + (1 | site/plot)
Data: data
AIC BIC logLik deviance df.resid
833.4 861.9 -409.7 819.4 428
Scaled residuals:
Min 1Q Median 3Q Max
-2.20675 -0.66119 -0.07051 0.52722 2.99942
Random effects:
Groups Name Variance Std.Dev.
plot:site (Intercept) 0.0003221 0.01795
site (Intercept) 0.2051143 0.45290
Residual 0.3656072 0.60465
Number of obs: 435, groups: plot:site, 24; site, 6
Fixed effects:
Estimate Std. Error df t value Pr(>|t|)
(Intercept) -0.538258 0.325072 50.071940 -1.656 0.10401
depth 0.006338 0.002377 431.768539 2.666 0.00796 **
temperature 0.391023 0.030837 432.101095 12.681 < 2e-16 ***
species2 -0.353264 0.231252 429.615226 -1.528 0.12734
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Correlation of Fixed Effects:
(Intr) depth temp
depth -0.316
temperature -0.467 -0.204
specie2 -0.544 0.040 0.007

I may have asked more questions than I answered, but I hope some of this is helpful.
"The question is whether species2 (a predator) has any effect on populations of species1, relative to the environmental variables."
I think when you word it this way, it is not entirely clear. Are you interested in the effect that species2 has on species1 - depending on what the environmental variables are (in other words the effect of species2 on species1 can change depending on depth or temperature? Or do you mean you would like to compare the effects of species2 on species1 to the effects of depth or temperature on species1? Or what do you mean, exactly, by "relative to the environmental variables"?
Yes, (1|year) + (1|site/plot) is a random intercept for both year and for plot within site. If you wanted a variable to be able to vary over each group (i.e. have a random slope) you would do something like (Temperature|year) + (1|site/plot) if you thought the effect of temperature on species1 might be different in different years.
Exactly how you specify the model is going to be based on your knowledge of the biological system and your knowledge of statistics. Based on the information in your question, this random effects formulation that you have suggested appears completely reasonable to me. Yes, this is allowing you to account for grouped data (grouped by each year and by each plot within site). It is possible that with only 435 observations you may have convergence issues with an overly complex model, which you may or may not have - just something to look out for.
I am not sure what you mean by "this does not seem right" - what are you expecting to see? What is missing?
I am seeing the same model twice (below), with different values as the output, is there a copy and pasting error here, or am I missing something? The values shouldn't be off with the same model structure.
model1 <- lmer(species1 ~ depth + temperature + species2 + (1|year) + (1|site/plot), data=data)
You haven't removed year in the above line, but have below this in the summary(M1) call.
My simple answer about the year question would be yes, I would include year. Every year is so different in any biological dataset I have seen that it is worth including as a random intercept at least - exactly as you have done. If the variance of the random effect mean is estimated to be zero, then this term is as if you didn't have it there in the first place. At that point you can choose to fit that random effect as a fixed effect instead if you still would like to account for the grouped nature of the data.
Also, there are lots of resources on this. Some examples:
Bolker, Benjamin M., Mollie E. Brooks, Connie J. Clark, Shane W. Geange, John R. Poulsen, M. Henry H. Stevens, and Jada-Simone S. White. "Generalized linear mixed models: a practical guide for ecology and evolution." Trends in ecology & evolution 24, no. 3 (2009): 127-135.
Harrison, Xavier A., Lynda Donaldson, Maria Eugenia Correa-Cano, Julian Evans, David N. Fisher, Cecily ED Goodwin, Beth S. Robinson, David J. Hodgson, and Richard Inger. "A brief introduction to mixed effects modelling and multi-model inference in ecology." PeerJ 6 (2018): e4794.
https://peerj.com/articles/4794/

Create a .bib file from bibitem (thebibliography)

I have like 200 bibitem entry in the environment
\begin{thebibliography}
\bibitem{Bermudez} Berm\'udez, J.D., J. V. Segura y E. Vercher (2010). \emph{Bayesian forecasting with the Holt-Winters model}. Journal of the Operational Research Society, 61, 164-171.
\begin{thebibliography}
I want the resulting .bib file format
#article{bermudez2010bayesian,
title={Bayesian forecasting with the Holt--Winters model},
author={Berm{\'u}dez, Jos{\'e} D and Segura, Jos{\'e} Vicente and Vercher, Enriqueta},
journal={Journal of the Operational Research Society},
volume={61},
number={1},
pages={164--171},
year={2010},
publisher={Taylor \& Francis}
}
Is there a way I can do it without converting one by one
Regards

One possibility is to use https://text2bib.economics.utoronto.ca/ to convert the \bibitem into bibtex format. Choosing Spanish as language, the output of the conversion is
#article{Bermudez,
author = {Berm\'udez, J. D. and J. V. Segura and E. Vercher},
journal = {Journal of the Operational Research Society},
pages = {164-171},
title = {{B}ayesian forecasting with the Holt-Winters model},
volume = {61},
year = {2010},
}
Some fields are missing, e.g. the publisher, because this information was not contained in your \bibitem

You can use tex2bib, a tool based on text2bib, but migrated to a newest PHP version (PHP 7).
See an example of use
Input of text transformation:
\bibitem{Bermudez} Berm\'udez, J.D., J. V. Segura y E. Vercher (2010). \emph{Bayesian forecasting with the Holt-Winters model}. Journal of the Operational Research Society, 61, 164-171.
Output:
#article{bv10,
author = {Berm\'udez, J. D. and J. V. Segura y E. Vercher},
title = {Bayesian forecasting with the Holt-Winters model},
journal = {Journal of the Operational Research Society},
year = {2010},
volume = {61},
pages = {164-171},
}

Biopython - Big Discrepancy Calculating RNA melting Temperature over Literature

I experience big discrepancies when calculating melting temperature of RNA 7-mers with Biopython over values generated by a popular algorithm.
I tried the nearest neighbour algorithm with RNA and salt concentrations as described in a respective paper (thermodynamic table used as in paper below from: Freier et al 1986). Yet, the values largely differ (execute code below to see).
I tried all seven salt correction methods provided by Biopython, still I never get close to the values generated by siRNA design algorithm for the same 7-mers.
Can someone tell me how accurate Biopython's melting temperature nearest neighbour algorithm is? Especially for short oligomers like my 7-mers? Is there maybe something I am implementing wrong? Any suggestions?
Values derived from executing sample input:
http://sidirect2.rnai.jp/
Tm is given for the seed duplex of the guide strand: bases 2-7
Literature:
"Thermodynamic stability and Watson–Crick
base pairing in the seed duplex are major
determinants of the efficiency of the
siRNA-based off-target effect"
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2602766/pdf/gkn902.pdf
from Bio.Seq import Seq
from Bio.SeqUtils import MeltingTemp
test_list = [
('GGAUUUG', 21.5),
('CUCAUUG', 18.1),
('CAUAUUC', 8.7),
('UUUGAGU', 19.2),
('UUUUGAG', 12.2),
('GUUUCAA', 14.9),
('AGUUUCG', 19.7),
('GAAGUUU', 13.3)
]
for t in test_list:
myseq = Seq(t[0])
tm = MeltingTemp.Tm_NN(myseq, dnac1=100, Na=100, nn_table=MeltingTemp.RNA_NN1, saltcorr=7) # NN1 = Freier et al (1986)
tm = round(tm, 1) # round to one decimal
print 'BioPython Tm: ' + str(tm) + ' siDirect Tm: ' + str(t[1])

I answered the question at biology.stackexchange and Biostars. In short: It seems that siDirect calculates the Tm wrong due to using a 1000fold higher primer concentration.

machine learning model to segregate input data

Consider these references as EXAMPLES :
1)Cahn, R. S.; Ingold, C.; Prelog, V. Specification of Molecular Chirality. Angew. Chem. Int. Ed. 1966, 5, 385-415.
2)Christie, G. H.; Kenner, J. The Molecular Configurations of Polynuclear Aromatic Compounds. J. Chem. Soc., Trans. 1922, 121, 614-620.
3)Kuhn, R. Molekulare Asymmetrie in Stereochemie, 1933, 803.
4)Oki, M. Recent Advances in Atropisomerism. Topics in Stereochemistry 1983, 14, 1-81.
5)Miyashita, A.; Yasuda, A.; Takaya, H.; Toriumi, K.; Ito, T.; Souchi, T.; Noyori, R. Synthesis of 2,2'-bis(diphenylphosphino)-1,1'-binaphthyl (BINAP), an atropisomeric chiral bis(triaryl)phosphine, and its
use in the rhodium(I)-catalyzed asymmetric hydrogenation of α-(acylamino)acrylic acids. J. Am. Chem.
Soc. 1980, 102, 7932-7934.
I have a large number of references ( like mentioned above ), and I want to segregate data from each of those references ( five for this case ) into separate parts such as
1) Name of the author(s),
2) Title of the topic
3) Date/year of publication
4) Pages and
5) Any other information
I want to create a machine learning model which should be capable of learning the formats by itself and get the right meaning/data from the references in to separate parts.The algorithm should be as much efficient as possible.
Question :
What algorithm and what approach should I have to use to implement the above functionality?
Would I be required to use multiple algorithms to create a model for the above scenario?

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart