Determining Exact URL from a link inside wiki text - url

In a wikipedia's article text, a link might be mentioned like: [Category:A B C], however the exact wiki url will have suffix like Category:A_B_C
From where I can get the information regarding all these rules which wiki uses to get the url from a link in its text ?(, e.g. converting spaces to underscores, capitalizing first letter, dealing with non-ascii characters etc)

Roughly the following:
Normalize namespace, e.g. category: --> Category:.
Uppercase the first letter of title proper, e.g. Category:foo --> Category:Foo. Note: this depends on wiki settings and titles are never uppercased on Wiktionary, for example.
Replace spaces with underscores, e.g. Foo bar --> Foo_bar.
Percent-encode all the usual characters with PHP's standard function urlencode(), except for the following ones: ;:#$!*(),/.
For full technical details you could look up this (function getLocalUrl()) and this (function wfUrlencode()).

There is no “etc.”, you already mentioned all the rules:
spaces are converted to underscores
the first letter of the article title is capitalized (the first letter of the namespace is capitalized too, if there is any)
the whole link is percent-encoded
Note that rules #1 and #2 are not mandatory: if you create your own URL that doesn't follow them, Wikipedia will still show the page correctly.
Things get more complicated if you include namespace aliases (WP:WikiProject Computing → Wikipedia:WikiProject_Computing) and interwiki links (wikia:gameofthrones:Westeros → http://www.wikia.com/wiki/c:gameofthrones:Westeros).

Related

Lua find and extract tags within a string

I feel questions similar to this have been asked previously but not related to html like tags or in Lua 5.4.
I have a string <NS>my_file_path.py</NS> <NS>count</NS> <NS>type: :model</NS> <TS>do some counting</TS> and ideally I'll be able to pick specific tags (and everything between it) such as <NS>type: :model</NS>, and remove it from the string before doing any further formatting.
I'm guessing some matching with <NS>type: would be a start but how I stop at </NS> is the confusing part!
First of all: Do not attempt to parse HTML (or XML) with RegEx (or Lua patterns). Use libraries instead.
However, if you're only interested in removing innermost tags (i.e. "leaf" tags; tags without children), your tags are strictly formatted in this simple fashing as in your example (no <tag spacing or attributes inside="tag" > allowed) and the scope of your project is very limited, you could use string.gsub and a pattern to remove these tags:
str = str:gsub("<NS>type:.-</NS>", "")
Pattern explanation:
find substrings starting with "<NS>type:"
allow for arbitrary content - zero or more arbitrary characters (.); note that this has to be lazy (-) instead of greedy (*) to work
stop matching the substring at the first occurrence of </NS>, closing the tag; if you used a greedy quantifier before, this would have stopped at the last occurrence of </NS>, exceeding the tag

Extracting text from APA citation

I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."

Titleize with roman numerals, dashes, apostrophes, etc. in Ruby on Rails

I'm simply trying to convert uppercased company names into proper names.
Company names can include:
Dashes
Apostrophes
Roman Numerals
Text like LLC, LP, INC which should stay uppercase.
I thought I might be able to use acronyms like this:
ACRONYMS = %W( LP III IV VI VII VIII IX GI)
ActiveSupport::Inflector.inflections(:en) do |inflect|
ACRONYMS.each { |a| inflect.acronym(a) }
end
However, the conversion does not take into account word breaks, so having VI and VII does not work. For example, the conversion of "ADVISORS".titleize is "Ad VI Sors", as the VI becomes a whole word.
Dashes get removed.
It seems like there should be a generic gem for this generic problem, but I didn't find one. Is this problem really not that common? What's the best solution besides completely hacking the current inflection library?
Company names are a little odd, since a lot of times they're Marks (as in Service Mark) more than proper names. That means precise capitalization might actually matter, and trying to titleize might not be worth it.
In any case, here's a pattern that might work. Build your list of tokens to "keep", then manually split the string up and titleize the non-token parts.
# Make sure you put long strings before short (VII before VI)
word_tokens = %w{VII VI IX XI}
# Special characters need to be separate, since they never appear as "part" of another word
special_tokens = %w{-}
# Builds a regex like /(\bVII\b|\bVI\b|-|)/ that wraps "word tokens" in a word boundary check
token_regex = /(#{word_tokens.map{|t| /\b#{t}\b/}.join("|")}|#{special_tokens.join("|")})/
title = "ADVISORS-XI"
title.split(token_regex).map{|s| s =~ token_regex ? s : s.titleize}.join

Matching function in erlang based on string format

I have user information coming in from an outside source and I need to check if that user is active. Sometimes I have a User and a Server and other times I have User#Server. The former case is no problem, I just have:
active(User, Server) ->
do whatever.
What I would like to do with the User#Server case is something like:
active([User, "#", Server]) ->
active(User, Server).
Doesn't seem to work. When calling active in the erlang terminal with a#b for example, I get an error that there is no match. Any help would be appreciated!
You can tokenize the string to get the result:
active(UserString) ->
[User,Server] = string:tokens(UserString,"#"),
active(User,Server).
If you need something more elaborate, or with better handling of something like email addresses, it might then be time to delve into using regular expressions with the re module.
active(UserString) ->
RegEx = "^([\\w\\.-]+)#([\\w\\.-]+)$",
{match, [User,Server]} = re:run(UserString,RegEx,[{capture,all_but_first,list}]),
active(User,Server).
Note: The supplied Regex is hardly sufficient for email address validation, it's just an example that allows all alphanumeric characters including underscores (\\w), dots (\\.), and dashes (-) seperated by an at symbol. And it will fail if the match doesn't stretch the whole length of the string: (^ to $).
A note on the pattern matching, for the real solution to your problem I think #chops suggestions should be used.
When matching patterns against strings I think it's useful to keep in mind that erlang strings are really lists of integers. So the string "#" is actually the same as [64] (64 being the ascii code for #)
This means that you match pattern [User, "#", Server] will match lists like: [97,[64],98], but not "a#b" (which in list form is [97,64,98]).
To match the string you need to do [User,$#,Server]. The $ operator gives you the ascii value of the character.
However this match pattern limits the matching string to be 1 character followed by # and then one more character...
It can be improved by doing [User, $# | Server] which allows the server part to have arbitrary length, but the User variable will still only match one single character (and I don't see a way around that).

Non-reserved yet safe characters for delimiters in a URL

I have seen the following on StackOverflow about URL characters:
There are two sets of characters you need to watch out for - Reserved and Unsafe.
The reserved characters are:
ampersand ("&")
dollar ("$")
plus sign ("+")
comma (",")
forward slash ("/")
colon (":")
semi-colon (";")
equals ("=")
question mark ("?")
'At' symbol ("#").
The characters generally considered unsafe are:
space,
question mark ("?")
less than and greater than ("<>")
open and close brackets ("[]")
open and close braces ("{}")
pipe ("|")
backslash ("\")
caret ("^")
tilde ("~")
percent ("%")
pound ("#").
I'm trying to code a URL so I can parse it using delimiters. They can't be numbers or letters though. Does anyone have a list of characters that are NOT Reserved but ARE safe to use?
Thanks for any help you can provide.
Don't bother trying to use safe/unreserved characters. Just use whatever delimiters you want and URLencode the whole thing. Then URL decode it on the other end and parse normally.
Is there a reason you can't just use the standard delimiter for URL parameters (&)? That is the most straightforward way to do it instead of trying to roll your own.
For example the standard URL syntax already allows for multi-valued paramaters natively. This is perfectly legal and doesn't require any trickery.
Somepage.aspx?parameterName=A&parameterName=B
The result is that the page would be passed "A,B" in the parameterName attribute.

Resources