Part of the data is not displayed fully in spss - spss

I want to create a new string variable (Education) which will contain data from other string variables (Listofuniversities, Listofschools).
The problem is that the data in the variable Education is not displayed fully. It is displayed like this:
Education
TU
Gymna
TL
My original dataset look like this:
Listofuniversities Listofschools
TU
Gymnasium van der Ort
TEU
Gymnasium van der Ort
TU
Gymnasium van der Art
TL
Gymnasium van der Art
This is the syntax that I have written.
STRING Education (A8).
RECODE Listofuniversities ('TU'='TU') ('TEU'='TEU') ('TL'='TL') INTO Education.
EXECUTE.
RECODE Listofschools ("Gymnasium van der Ort" = "Gymnasium van der Ort") into Education.
VARIABLE WIDTH Education(20).
EXECUTE.

Your data looks as if there are two fields "Listofuniversities" and "Listofschools"; both string fields. It seems like the two fields are independent of one another. When there is a non-blank value in one, there is a blank in the other. Was this intended? If not, I'd look at how you read the data into the program.
Your first command creates a string field 8 characters wide (Education). The values you try to put into "Education" (from "Listofschools" at least) are clearly more than 8 characters wide. So it is appropriate to define "Education" as a wider field. e.g. STRING Education (A50).
If your intent is to consolidate these values across records:
STRING Education (A50).
DO IF (Listofschools=" ").
COMPUTE Education = RTRIM(LTRIM(ListofUniversities)).
ELSE.
COMPUTE Education = RTRIM(LTRIM(Listofschools)).
END IF.
EXECUTE.

Related

LaTeX - Save the value of integer variables to file, then load those integer values on future compile runs and reuse them

it's my first question here, I'm trying my best.
So, I'm trying to solve a problem in LaTeX:
Goal: I'm writing an exam/test that is used in similar shapes multiple times a year. The number of tasks as well as the number of pages on the exam is therefore changing. On the first page of the exam the number of pages, number of tasks and number of total points shell be displayed. They all shell be calculated automatically, so they don't need to be set mannually each time the document is changed.
Primitive Problem: I'm aware that I can use counters for this (for the number of pages there even is already a counter built in). The problem is those counters have their right values only at the end of the document. By that time the fromt page of the document is already "written". So how do I get the values those counters have at the end of the document to the front of the document? (TeX has something like thsi as this is how the ToC works).
Current Solution: Trying to mimic what TeX does with its ToC, I'm saving the values of my relevant counters to a file at the end of the document. At the start of the document I load values from that file and use them throughout the document. Obviously I need to compile twice for the correct values to be displayed (first round wrong values are loaded, but right values are calculated and stored. Second round right values are loaded and used).
Here is code snippits of that I'm using so far:
(this is not the full code, I cut out everything I thought to be unnecessary and focus on what I thought to be essential to this problem.)
% Zum Handeln der Counter
\usepackage{datatool}
\DTLsetseparator{;}
\DTLloaddb[noheader, keys={thekey,thevalue}]{counters}{counters.dat}
\newcommand{\var}[1]{\DTLfetch{counters}{thekey}{#1}{thevalue}}
\begin{document}
% Folgende Counter sind gewöhnliche Counter.
\newcounter {aufgabenNummer}
\setcounter {aufgabenNummer} {0}
\newcounter {punkteSumme}
\setcounter {punkteSumme} {0}
\newcounter{punkteSummeTeilEins}
\newcounter{punkteSummeTeilZwei}
% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
% Beginn des Dokuments
\thispagestyle{Deckblatt}
\begin{itemize}
\item Bitte prüfen Sie die Vollständigkeit: Der Test besteht aus den Aufgaben 1 bis \var{AnzahlAufgaben} und umfasst \var{AnzahlSeiten} Seiten.
\end{itemize}
\newpage
\setlength{\parindent}{0pt}
\input{Grundlagen}
% Place Code snippit from below would be inserted.
% ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
%Speichere die Werte der Counter in Datei für spätere Verwendung auf dem Deckblatt.
\makeatletter
\newwrite\outfile
\immediate\openout\outfile=counters.dat
\immediate\write\outfile{AnzahlSeiten;\thepage}
\immediate\write\outfile{AnzahlAufgaben;\theaufgabenNummer}
\immediate\write\outfile{AnzahlPunkte;\thepunkteSumme}
\immediate\write\outfile{AnzahlPunkteTeilEins;\thepunkteSummeTeilEins}
\immediate\write\outfile{AnzahlPunkteTeilZwei;\thepunkteSummeTeilZwei}
\immediate\closeout\outfile
\makeatother
\end{document}
Problem with current Solution: My main problem with this solution atm is that the values I load from "counters.dat" through DTL are string values. I however would like to get them as integer values I can then use for some basic arithmetic.
However if I try to insert the following line
\setcounter{punkteSummeTeilZwei}{\var{AnzahlPunkteTeilZwei}}
I get an error:
Missing number, treated as zero. ...SummeTeilZwei}{\var{AnzahlPunkteTeilZwei}}
If I instead put the following line there
\setcounter{punkteSummeTeilZwei}{\value{\var{AnzahlPunkteTeilZwei}}}
I get another error:
Missing \endcsname inserted. ...lZwei}{\value{\var{AnzahlPunkteTeilZwei}}}
Missing number, treated as zero. ...lZwei}{\value{\var{AnzahlPunkteTeilZwei}}}
Extra \endcsname. ...lZwei}{\value{\var{AnzahlPunkteTeilZwei}}}
So I hopefully could explain my problem.
Maybe someone can point me to another solution that does waht I want but in a more clever way, or someone can improve my code so the errors I'm encountering right now are resolved.
Thanks for your time and help :D

How to add name in bib file in latex

I have a name like Catherine de Palo Drid . I want its reference like
Drid C de P
I add like that
author={Drid, Catherine, de, Palo}
I have changed many times the arrangement, but neither works.
Can anyone help?
THANKS
From the BibTeX documentation (btxdoc.pdf, p. 15/16):
Each name consists of four parts: First, von, Last, and Jr;
BibTeX allows three possible forms for the name:
"First von Last"
"von Last, First"
"von Last, Jr, First"
You want to treat "Catherine de Palo" as First and "Drid" as Last, since abbreviating names is only done in First. In that case I would use
author = {Drid, Catherine {de} Palo}
where the braces around "de" tell BibTeX to not alter that token.

Can we find sentences around an entity tagged via NER?

We have a model ready which identifies a custom named entity. The problem is if the whole doc is given then the model does not work as per expecation if only a few sentences are given, it is giving amazing results.
I want to select two sentences before and after a tagged entity.
eg. If a part of the doc has world Colombo(which is tagged as GPE), I need to select two sentences before the tag and 2 sentences after the tag. I tried a couple of approaches but the complexity is too high.
Is there a built-in way in spacy with which we can address this problem?
I am using python and spacy.
I have tried parsing the doc by identifying the index of the tag. But that approach is really slow.
It might be worth it to see if you can improve the custom named entity recognizer, because it should be unusual for extra context to hurt performance and potentially if you fix that issue it will work better overall.
However, regarding your concrete question about surrounding sentences:
A Token or a Span (an entity is a Span) has a .sent attribute that gives you the covering sentence as a Span. If you look at the tokens right before/after a given sentence's start/end tokens, you can get the previous/next sentences for any token in a document.
import spacy
def get_previous_sentence(doc, token_index):
if doc[token_index].sent.start - 1 < 0:
return None
return doc[doc[token_index].sent.start - 1].sent
def get_next_sentence(doc, token_index):
if doc[token_index].sent.end + 1 >= len(doc):
return None
return doc[doc[token_index].sent.end + 1].sent
nlp = spacy.load('en_core_web_lg')
text = "Jane is a name. Here is a sentence. Here is another sentence. Jane was the mayor of Colombo in 2010. Here is another filler sentence. And here is yet another padding sentence without entities. Someone else is the mayor of Colombo right now."
doc = nlp(text)
for ent in doc.ents:
print(ent, ent.label_, ent.sent)
print("Prev:", get_previous_sentence(doc, ent.start))
print("Next:", get_next_sentence(doc, ent.start))
print("----")
Output:
Jane PERSON Jane is a name.
Prev: None
Next: Here is a sentence.
----
Jane PERSON Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
2010 DATE Jane was the mayor of Colombo in 2010.
Prev: Here is another sentence.
Next: Here is another filler sentence.
----
Colombo GPE Someone else is the mayor of Colombo right now.
Prev: And here is yet another padding sentence without entities.
Next: None
----

Extracting text from APA citation

I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."

Canonicalize NFL team names

This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)
The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.
I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:
http://www.footballlocks.com/nfl_odds.shtml
http://www.repole.com/sun4cast/freepick.shtml
Ideas for better, more parsable, sources of data are very welcome!
Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?
Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.
Scan[(fullname[First##] = #[[2]])&, {
{"ari", "Arizona Cardinals"}, {"atl", "Atlanta Falcons"},
{"bal", "Baltimore Ravens"}, {"buf", "Buffalo Bills"},
{"car", "Carolina Panthers"}, {"chi", "Chicago Bears"},
{"cin", "Cincinnati Bengals"}, {"clv", "Cleveland Browns"},
{"dal", "Dallas Cowboys"}, {"den", "Denver Broncos"},
{"det", "Detroit Lions"}, {"gbp", "Green Bay Packers"},
{"hou", "Houston Texans"}, {"ind", "Indianapolis Colts"},
{"jac", "Jacksonville Jaguars"}, {"kan", "Kansas City Chiefs"},
{"mia", "Miami Dolphins"}, {"min", "Minnesota Vikings"},
{"nep", "New England Patriots"}, {"nos", "New Orleans Saints"},
{"nyg", "New York Giants NYG"}, {"nyj", "New York Jets NYJ"},
{"oak", "Oakland Raiders"}, {"phl", "Philadelphia Eagles"},
{"pit", "Pittsburgh Steelers"}, {"sdc", "San Diego Chargers"},
{"sff", "San Francisco 49ers forty-niners"}, {"sea", "Seattle Seahawks"},
{"stl", "St Louis Rams"}, {"tam", "Tampa Bay Buccaneers"},
{"ten", "Tennessee Titans"}, {"wsh", "Washington Redskins"}}]
Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:
teams = keys#fullnames;
(* argMax[f, domain] returns the element of domain for which f of that element is
maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First#dom, Rest#dom]
canonicalize[s_] := argMax[StringLength#LongestCommonSubsequence[" "<>s<>" ",
" "<>fullname##<>" ", IgnoreCase->True]&, teams]
Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:
Denver
Minnesota
Arizona
Jacksonville
and the other looks like
Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars
Seems like, in this case, some pretty simple substring matching would do it.
If you know both the source and destination names, then you just need to map them.
In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:
$map = array('49ers' => 'San Francisco 49ers',
'packers' => 'Green Bay Packers');
foreach($incoming_name as $name) {
echo $map[$name];
}

Resources