Simple Wikipedia Text into Plain Text Parser?

Simple Wikipedia Text into Plain Text Parser? - parsing

I'm searching for a simple parser that translates a String with wiki markup code to readable plain text, e.g.
A lot of these sources can also be used to add to other parts of the article, like the plot section. <font color="silver">[[User:Silver seren|Silver]]</font><font color="blue">[[User talk:Silver seren|seren]]</font><sup>[[Special:Contributions/Silver seren|C]]</sup> 05:34, 22 March 2012 (UTC)
to
A lot of these sources can also be used to add to other parts of the article, like the plot section. SilverserenC 05:34, 22 March 2012 (UTC)
I tried it with DKPro JWPL (where also the above example comes from) but this framework plain text output doesn't parse wiki talk pages (discussions) in the right way. It simply deletes lines that start with a number of ":" characters which are crucial for the talk pages.

Okay, I found out that the old wikipedia parser from JWPL is working: "de.tudarmstadt.ukp.wikipedia.parser"
link
You can use it like:
MediaWikiParserFactory pf = new MediaWikiParserFactory(Language.english);
MediaWikiParser parser = pf.createParser();
ParsedPage pp = parser.parse("some wiki code with markups");
System.out.println(pp.getText());

Related

Mendeley-Overleaf Double Brace Issue

I am not sure if this is the appropriate place for this post, please let me know if I should post this somewhere else.
So I am using Overleaf and Mendeley to write my paper. I am using the online version of Mendeley and I sync my references into Overleaf mostly without any issues. But some of the references generated have double an extra set of {}, an example of this is given below:
#incollection{Alberts2002TheProtein,
title = {{The shape and structure of protein}},
year = {2002},
booktitle = { Molecular Biology of the Cell},
author = {Alberts, Bruce and Johnson, Alexander and Lewis, Julian and Raff, Martin and Roberts, Keith and Walter, Peter},
edition = {4},
month = {3},
volume = {},
publisher = {Garland Science}}
As you can see in the title attribute an extra set of parentheses was added by Mendeley. Whenever this happens I get a warning from Overleaf stating BibTeX: "{Essential" isn't a brace-balanced string for entry Alberts2010EssentialBiology. This happens only for a few of my references.
Since I am syncing my references to Overleaf I can't edit the .bib file in Overleaf and remove the extra set of parenthesis. Is there any way to remove it directly on the Mendeley website? Or set a style or preference?
I know I can download my references and then add them to a file and upload that file to Overleaf, I am trying to find a better alternative. Thank you for reading this.
Edit: Reproducible Example, you can add the following details on Mendeley and then try to get the BibTex citation code:
Type: Book Section
Book Title: Molecular Biology of the Cell
Title: The shape and structure of protein
Authors: (Last name, First name)
1. Alberts, Bruce
2. Johnson, Alexander
3. Lewis, Julian
4. Raff, Martin
5. Roberts, Keith
6. Walter, Peter
Edition: 4
Month:3
Publisher: Garland Science
Year =2002
I get the code I have posted above with two parentheses around title.
Edit: Removing the extra Pair of parenthesis resolves the warning.

BBEdit: how to write a replacement pattern when a back reference is immediately followed by a number

I'm new to GREP in BBEdit. I need to find a string inside an XML file. Such string is enclosed in quotes. I need to replace only what's inside the quotes.
The problem is that the replacement string starts with a number thus confuses BBEdit when I put together the replacement pattern. Example:
Original string in XML looks like this:
What I need to replace it with:
01 new file name.png
My grep search and replace patterns:
Using the replacement pattern above, BBEdit wrongly thinks that the first backreference is "\101" when what I really need it understand is that I mean "\01".
TIA for any help.

Your example is highly artificial because in fact there is no need for your \1 or \3 as you know their value: it is " and you can just type that directly to get the desired result.
"01 new file name.png"
However, just for the sake of completeness, the answer to your actual question (how to write a replacement group number followed by a number) is that you write this:
\0101 new file name.png\3
The reason that works is that there can only be 99 capture groups, so \0101 is parsed as \01 (the first capture group) followed by literal 01.

Add custom localized text in DateTimeFormatter, for a new custom TemporalField

I've created a CENTURY field that implements java.time.temporal.TemporalField - this question is not focusing on the correct implementation details of such field (which will be handled later), I'm interested in the DateTimeFormatter issue as explained below.
Basically, the field gets the ChronoField.YEAR of a temporal object and uses this value to calculate the century (the calculation is made in getFrom(TemporalAccessor temporal) method, considering that the 1st century is from year 1 to 100 - but as I said, let's not stick too much in these details).
The most basic usage is:
LocalDate.of(2017, 1, 1).get(CENTURY); // 21
Which returns 21 in this case.
The field can also be used in a DateTimeFormatter:
DateTimeFormatter fmt = new DateTimeFormatterBuilder()
.appendPattern("dd/MM/yyyy ")
.appendValue(CENTURY)
.toFormatter();
System.out.println(fmt.format(LocalDate.of(2017, 1, 1))); // 01/01/2017 21
The output for the above is:
01/01/2017 21
But what I want to do is to use a custom localized text for this field. If I create a formatter like this:
DateTimeFormatter fmt = new DateTimeFormatterBuilder()
.appendPattern("dd/MM/yyyy ")
// century text
.appendText(CENTURY, TextStyle.SHORT)
// use English locale
.toFormatter(Locale.ENGLISH);
System.out.println(fmt.format(LocalDate.of(2017, 1, 1))); // 01/01/2017 21
Since there's no localized data for my new CENTURY field, the text is only its own value 21.
I'm trying to find a way to add custom localized strings for this field, like it's done with month and day of week, for example (let's assume that I already have the resource bundle properties files set).
Checking the source code, I've found that the formatter internally uses a TextPrinterParser, which in turn uses a DateTimeTextProvider to get the localized strings, but none of those classes are public and can't be used nor extended. And the API doesn't seem to provide a way to add custom localized strings for new fields.
I could do it only by using reflection and a java.lang.reflect.Proxy to overwrite the behaviour of the TextPrinterParser, but I wonder if there's a better way (that doesn't require all this "magic").
How can this be done (if possible)?
I know I could also use appendText(TemporalField field, Map<Long,String> textLookup), but that wouldn't be a "locale sensitive" solution (although it seems to be the best workaround available).

This is not possible in java.time.* today. The DateTimeTextProvider class was intended to be extensible, but this got descoped during development. Providing pluggable text providers would be a useful enhancement to Java.

Parsing date and time with the new java.time.DateTimeFormatter

I have a date of this type: 2004-12-31 23:00:00-08 but no one of the patterns i know and i have used from the documentation is working. I thought it should something like "yyyy-MM-dd HH:mm:ssX" but it isn't working.

Sorry for you, but this is a known bug and was already reported in January 2014. According to the bug log a possible solution is deferred.
A simple workaround avoiding alternative external libraries is text preprocessing. That means: Before you parse the text you just append the prefix ":00". Example:
String input = "2004-12-31 23:00:00-08";
String zero = ":00";
if (input.charAt(input.length() - 3) == ':') {
zero = "";
}
ZonedDateTime zdt =
ZonedDateTime.parse(
input + zero,
DateTimeFormatter.ofPattern("uuuu-MM-dd HH:mm:ssXXX"));
System.out.println(zdt);
// output: 2004-12-31T23:00-08:00
UPDATE due to debate with #Seelenvirtuose:
As long as you ONLY have offsets with just hours but without minute part then the pattern "uuuu-MM-dd HH:mm:ssX" will solve your problem, too (as #Seelenvirtuose has correctly stated in his comment).
But if you have to process a list of various strings with mixed offsets like "-08", "Z" or "+05:30" (latter is India standard time) then you should usually apply the pattern containing three XXX. But this currently fails (have verified it by testing in last version of Java-8). So in this case you still have to do text preprocessing and/or text analysis.

Correct word-count of a LaTeX document

I'm currently searching for an application or a script that does a correct word count for a LaTeX document.
Up till now, I have only encountered scripts that only work on a single file but what I want is a script that can safely ignore LaTeX keywords and also traverse linked files...ie follow \include and \input links to produce a correct word-count for the whole document.
With vim, I currently use ggVGg CTRL+G but obviously that shows the count for the current file and does not ignore LaTeX keywords.
Does anyone know of any script (or application) that can do this job?

I use texcount. The webpage has a Perl script to download (and a manual).
It will include tex files that are included (\input or \include) in the document (see -inc), supports macros, and has many other nice features.
When following included files you will get detail about each separate file as well as a total. For example here is the total output for a 12 page document of mine:
TOTAL COUNT
Files: 20
Words in text: 4188
Words in headers: 26
Words in float captions: 404
Number of headers: 12
Number of floats: 7
Number of math inlines: 85
Number of math displayed: 19
If you're only interested in the total, use the -total argument.

I went with icio's comment and did a word-count on the pdf itself by piping the output of pdftotext to wc:
pdftotext file.pdf - | wc - w

latex file.tex
dvips -o - file.dvi | ps2ascii | wc -w
should give you a fairly accurate word count.

To add to #aioobe,
If you use pdflatex, just do
pdftops file.pdf
ps2ascii file.ps|wc -w
I compared this count to the count in Microsoft Word in a 1599 word document (according to Word). pdftotext produced a text with 1700+ words. texcount did not include the references and produced 1088 words. ps2ascii returned 1603 words. 4 more than in Word.
I say that's a pretty good count. I am not sure where's the 4 word difference, though. :)

In Texmaker interface you can get the word count by right clicking in the PDF preview:

Overleaf has a word count feature:
Overleaf v2:
Overleaf v1:

I use the following VIM script:
function! WC()
let filename = expand("%")
let cmd = "detex " . filename . " | wc -w | perl -pe 'chomp; s/ +//;'"
let result = system(cmd)
echo result . " words"
endfunction
… but it doesn’t follow links. This would basically entail parsing the TeX file to get all linked files, wouldn’t it?
The advantage over the other answers is that it doesn’t have to produce an output file (PDF or PS) to compute the word count so it’s potentially (depending on usage) much more efficient.
Although icio’s comment is theoretically correct, I found that the above method gives quite accurate estimates for the number of words. For most texts, it’s well within the 5% margin that is used in many assignments.

If the use of a vim plugin suits you, the vimtex plugin has integrated the texcount tool quite nicely.
Here is an excerpt from their documentation:
:VimtexCountLetters Shows the number of letters/characters or words in
:VimtexCountWords the current project or in the selected region. The
count is created with `texcount` through a call on
the main project file similar to: >
texcount -nosub -sum [-letter] -merge -q -1 FILE
<
Note: Default arguments may be controlled with
|g:vimtex_texcount_custom_arg|.
Note: One may access the information through the
function `vimtex#misc#wordcount(opts)`, where
`opts` is a dictionary with the following
keys (defaults indicated): >
'range' : [1, line('$')]
'count_letters' : 0/1
'detailed' : 0
<
If `detailed` is 0, then it only returns the
total count. This makes it possible to use for
e.g. statusline functions. If the `opts` dict
is not passed, then the defaults are assumed.
*VimtexCountLetters!*
*VimtexCountWords!*
:VimtexCountLetters! Similar to |VimtexCountLetters|/|VimtexCountWords|, but
:VimtexCountWords! show separate reports for included files. I.e.
presents the result of: >
texcount -nosub -sum [-letter] -inc FILE
<
*VimtexImapsList*
*<plug>(vimtex-imaps-list)*
The nice part about this is how extensible it is. On top of counting the number of words in your current file, you can make a visual selection (say two or three paragraphs) and then only apply the command to your selection.

For a very basic article class document I just look at the number of matches for a regex to find words. I use Sublime Text, so this method may not work for you in a different editor, but I just hit Ctrl+F (Command+F on Mac) and then, with regex enabled, search for
(^|\s+|"|((h|f|te){)|\()\w+
which should ignore text declaring a floating environment or captions on figures as well as most kinds of basic equations and \usepackage declarations, while including quotations and parentheticals. It also counts footnotes and \emphasized text and will count \hyperref links as one word. It's not perfect, but it's typically accurate to within a few dozen words or so. You could refine it to work for you, but a script is probably a better solution, since LaTeX source code isn't a regular language. Just thought I'd throw this up here.

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Simple Wikipedia Text into Plain Text Parser? - parsing

Related

Mendeley-Overleaf Double Brace Issue

BBEdit: how to write a replacement pattern when a back reference is immediately followed by a number

Add custom localized text in DateTimeFormatter, for a new custom TemporalField

Parsing date and time with the new java.time.DateTimeFormatter

Correct word-count of a LaTeX document

Categories

Resources