How do I parse Dublin Core data in iOS? The data is like this:
dc:contributor Binder, Henry.
dc:creator Crane, Stephen, 1871-1900.
dc:date c1982
dc:description This stirring tale of action in the American Civil War captures the immediacies and experiences of actual battle and army life.
dc:format ix, 173 p. ; 22 cm.
dc:identifier 0393013456
dc:identifier 9780393013450
dc:identifier 0380641135 (pbk.)
dc:identifier 9780380641130 (pbk.)
dc:language eng
dc:publisher Norton
dc:subject Chancellorsville, Battle of, Chancellorsville, Va., 1863--Fiction.
dc:subject Virginia--History--Civil War, 1861-1865--Fiction.
dc:title The red badge of courage : an episode of the American Civil War
dc:type War stories.
dc:type Historical fiction.
dc:type Text
oclcterms:recordCreationDate 811217
oclcterms:recordIdentifier 81022419
oclcterms:recordIdentifier 8114241
It looks like your data structure structure has a predictable format, ie each line has a clear tag and then a tab-like space.
Hence if you are familiar with the tag names you can definitely create a small program to consume each line accordingly.
Related
An important question came up when I tried to translate an existing iOS application into Lithuanian. I know how the Apple translation system works, especially for languages like English or Hungarian. But how I have to translate Lithuanian nouns in combination with numerals I don’t know.
The Lithuanian grammar in conjunction with numerals works like this for the word "įvykis" (event):
Lithuanian English
0 įvykių 0 events
1 įvykis 1 event
2 - 9 įvykiai 2 - 9 events
10 - 20 įvykių 10 - 20 events
21 įvykis 21 events
22 -29 įvykiai 22 - 29 events
30 įvykių 30 events
the same logic continuous
as of 21
More information about Lithuanian noun declension by numerals can be found in this Wikipedia article.
My question is, what key values have to be filled into the "Localizable.stringsdict" for Lithuanian? For English this file looks like this:
and for Lithuanian the same file looks this:
Those entries in the last table just partly correct. Does anyone know which keys I have to use in order to map my table into the stringsdict table? Which keys/keywords are necessary?
In the stringsdict file you can only have the keys zero, one, two, few, many, and other. That is all you actually need. iOS has its own data (based on information from the Unicode standard) that tells it which of those keys to use based on the actual number.
This is covered in the (now archived) Internationalization and Localization Guide, specifically the Handling Noun Plurals and Units Of Measure chapter with specifics about the stringsdict file in Appendix C.
You may also find language specific rules from Unicode. Scroll down to Lithianian and you will see the built in rules on how the category is used with a given number.
In short, you want the following for your "events" in Lithuanian:
one - %d įvykis
few - %d įvykiai
other - %d įvykių
iOS will know to use one for 1, 21, 31, 41, etc. It will know to use few for 2~9, 22~29, etc. It will know to use other for 0, 10~20, 30, etc.
I'm searching for a simple parser that translates a String with wiki markup code to readable plain text, e.g.
A lot of these sources can also be used to add to other parts of the article, like the plot section. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 05:34, 22 March 2012 (UTC)
to
A lot of these sources can also be used to add to other parts of the article, like the plot section. SilverserenC 05:34, 22 March 2012 (UTC)
I tried it with DKPro JWPL (where also the above example comes from) but this framework plain text output doesn't parse wiki talk pages (discussions) in the right way. It simply deletes lines that start with a number of ":" characters which are crucial for the talk pages.
Okay, I found out that the old wikipedia parser from JWPL is working: "de.tudarmstadt.ukp.wikipedia.parser"
link
You can use it like:
MediaWikiParserFactory pf = new MediaWikiParserFactory(Language.english);
MediaWikiParser parser = pf.createParser();
ParsedPage pp = parser.parse("some wiki code with markups");
System.out.println(pp.getText());
I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."
Does WordNet have "higher order" concepts? How to generate them for a given word?
I have a corpus of data in the form of prolog 'facts'. I want to generalize the conceptual components, i.e. 'contains'('oranges', 'vitamin c'). and 'contains'('spinach','iron'). would be generalized to 'contains'(<food>, <nutrient>).
I don't know WordNet very well, so one thing I was thinking about was just to generate all possible hypernyms and then combinatorially elaborate every possible new rule, but this is sort of a 'brute force' approach.
Does WordNet store higher order concepts such as <food> for instance? That might make it easier, because then I can just create one new rule with the higher order concept of that particular variable, assuming that there is one in WordNet, as opposed to perhaps fifty or one hundred if I do it the brute force way.
So what I actually want to know is: is there a command to generate the higher order concepts for each of the three components within a given 'fact'? Or maybe just for the two that are inside the parentheses. If such a command exits, what is it?
Below is some of the data I'm working with for reference.
'be'('mr jiang', 'representing china').
'be'('hrh', 'britain').
'be more than'('# distinguished guests', 'the principal representatives').
'end with'('the playing of the british national anthem', 'hong kong').
'follow at'('the stroke of midnight', 'this').
'take part in'('the ceremony', 'both countries').
'start at about'('# pm', 'the ceremony').
'end about'('# am', 'the ceremony').
'lower'('the british hong kong flag', '# royal hong kong police officers').
'raise'('the sar flag', 'another #').
'leave for'('the royal yacht britannia', 'the #').
'hold by'('the chinese and british governments', 'the handover of hong kong').
'rise over'('this land', 'the regional flag of the hong kong special administrative region of the people \'s republic of china').
'cast eye on'('hong kong', 'the world').
'hold on'('schedule', 'the # governments').
'be festival for'('the chinese nation', 'this').
'go in'('the annals of history', 'july # , #').
'become master of'('this chinese land', 'the hong kong compatriots').
'enter era of'('development', 'hong kong').
'remember'('mr deng xiaoping', 'history').
'be along'('the course', 'it').
'resolve'('the hong kong question', 'we').
'wish to express thanks to'('all the personages', 'i').
'contribute to'('the settlement of the hong kong', 'both china and britain').
'support'('hong kong \'s return', 'the world').
Wordnet refers to higher-order concepts as "hypernyms". A hypernym for the color "green," for instance, is "chromatic color", because the color green belongs to the higher-order class of chromatic colors.
One should note that Wordnet differentiates between "words" (strings of characters) and "sysnets" (the meaning we associate with a given string of characters). Just as one word can have multiple meanings, one string can have multiple synsets. If you want to retrieve all of the higher-order meanings for a given word, you can run these lines in Python:
from nltk import wordnet as wn
# If you are using nltk version 3.0.1, the following will tell you all the synsets for "green" and will thenn find all of their hypernyms. If you're running nltk 3.0.0, you can change the first line to `for synset in wn.synsets('bank'):
for synset in wn.wordnet.synsets('green'):
for hypernym in synset.hypernyms():
print synset, hypernym
This is actually a machine learning classification problem but I imagine there's a perfectly good quick-and-dirty way to do it. I want to map a string describing an NFL team, like "San Francisco" or "49ers" or "San Francisco 49ers" or "SF forty-niners", to a canonical name for the team. (There are 32 NFL teams so it really just means finding the nearest of 32 bins to put a given string in.)
The incoming strings are not actually totally arbitrary (they're from structured data sources like this: http://www.repole.com/sun4cast/stats/nfl2008lines.csv) so it's not really necessary to handle every crazy corner case like in the 49ers example above.
I should also add that in case anyone knows of a source of data containing both moneyline Vegas odds as well as actual game outcomes for the past few years of NFL games, that would obviate the need for this. The reason I need the canonicalization is to match up these two disparate data sets, one with odds and one with outcomes:
http://www.footballlocks.com/nfl_odds.shtml
http://www.repole.com/sun4cast/freepick.shtml
Ideas for better, more parsable, sources of data are very welcome!
Added: The substring matching idea might well suffice for this data; thanks! Could it be made a little more robust by picking the team name with the nearest levenshtein distance?
Here's something plenty robust even for arbitrary user input, I think. First, map each team (I'm using a 3-letter code as the canonical name for each team) to a fully spelled out version with city and team name as well as any nicknames in parentheses between city and team name.
Scan[(fullname[First##] = #[[2]])&, {
{"ari", "Arizona Cardinals"}, {"atl", "Atlanta Falcons"},
{"bal", "Baltimore Ravens"}, {"buf", "Buffalo Bills"},
{"car", "Carolina Panthers"}, {"chi", "Chicago Bears"},
{"cin", "Cincinnati Bengals"}, {"clv", "Cleveland Browns"},
{"dal", "Dallas Cowboys"}, {"den", "Denver Broncos"},
{"det", "Detroit Lions"}, {"gbp", "Green Bay Packers"},
{"hou", "Houston Texans"}, {"ind", "Indianapolis Colts"},
{"jac", "Jacksonville Jaguars"}, {"kan", "Kansas City Chiefs"},
{"mia", "Miami Dolphins"}, {"min", "Minnesota Vikings"},
{"nep", "New England Patriots"}, {"nos", "New Orleans Saints"},
{"nyg", "New York Giants NYG"}, {"nyj", "New York Jets NYJ"},
{"oak", "Oakland Raiders"}, {"phl", "Philadelphia Eagles"},
{"pit", "Pittsburgh Steelers"}, {"sdc", "San Diego Chargers"},
{"sff", "San Francisco 49ers forty-niners"}, {"sea", "Seattle Seahawks"},
{"stl", "St Louis Rams"}, {"tam", "Tampa Bay Buccaneers"},
{"ten", "Tennessee Titans"}, {"wsh", "Washington Redskins"}}]
Then, for any given string, find the longest common subsequence for each of the full names of the teams. To give preference to strings matching at the beginning or the end (eg, "car" should match "carolina panthers" rather than "arizona cardinals") sandwich both the input string and the full names between spaces. Whichever team's full name has the [sic:] longest longest-common-subsequence with the input string is the team we return. Here's a Mathematica implementation of the algorithm:
teams = keys#fullnames;
(* argMax[f, domain] returns the element of domain for which f of that element is
maximal -- breaks ties in favor of first occurrence. *)
SetAttributes[argMax, HoldFirst];
argMax[f_, dom_List] := Fold[If[f[#1] >= f[#2], #1, #2] &, First#dom, Rest#dom]
canonicalize[s_] := argMax[StringLength#LongestCommonSubsequence[" "<>s<>" ",
" "<>fullname##<>" ", IgnoreCase->True]&, teams]
Quick inspection by sight shows that both data sets contain the teams' locations (i.e. "Minnesota"). Only one of them has the teams' names. That is, one list looks like:
Denver
Minnesota
Arizona
Jacksonville
and the other looks like
Denver Broncos
Minnesota Vikings
Arizona Cardinals
Jacksonville Jaguars
Seems like, in this case, some pretty simple substring matching would do it.
If you know both the source and destination names, then you just need to map them.
In php, you would just use an array with keys from the data source and values from the destination. Then you would reference them like:
$map = array('49ers' => 'San Francisco 49ers',
'packers' => 'Green Bay Packers');
foreach($incoming_name as $name) {
echo $map[$name];
}