I want to select logo image urls from a HTML string. I assume that the logo image URL will have text 'logo' somewhere in it's URL.
Need a regex that selects Image URLs from a given HTML string text. The logo URL will have text 'logo' in it's path.
/(https?:\/\/(?:www\.)?[\w+-_.0-9#\/]+logo.(?:png|jpg|jpeg))/i
["https://static.infragistics.com/marketing/Website/home/espn-logo.png", "https://static.infragistics.com/marketing/Website/home/mondelez-logo.png", "https://static.infragistics.com/marketing/Website/home/nielsen-logo.png", "https://static.infragistics.com/marketing/Website/home/united-logo.png", "https://static.infragistics.com/marketing/Website/home/merrill-lynch-logo.png", "https://static.infragistics.com/marketing/Website/home/dell-logo.png", "https://static.infragistics.com/marketing/Website/home/intel-logo.png", "https://static.infragistics.com/marketing/Website/home/prudential-logo.png", "https://static.infragistics.com/marketing/Website/home/mcdonalds-logo.png"]
The text logo can present anywhere in the URL.
Need a regex that picks Image URls which has text 'logo' in it.
Maybe, it would be a good idea to reduce our constraints in the expression, maybe without even word boundaries for logo, with some expression similar to:
(?i)^(?=.*logo)(?:https)?:\/\/\S+(?:png|jpe?g|gif|tiff)$
where,
(?=.*logo)
would simply check if there is a logo anywhere in the URL.
If we would just want to check the word logo, in the image names,
espn-logo.png
espn-logos.png
we would move our positive lookahead forward, after the last slash, for instance:
(?i)^(?:https)?:\/\/\S+\/(?=.*logo).*(?:png|jpe?g|gif|tiff)$
and our desired image extensions would go in this non-capturing group:
(?:png|jpe?g|gif|tiff|svg)
Test
re = /(?i)^(?=.*logo)(?:https)?:\/\/\S+(?:png|jpe?g|gif|tiff)$/s
str = 'https://static.infragistics.com/marketing/Website/home/espn-logo.png
https://static.infragistics.com/marketing/Website/home/mondelez-logo.gif
https://static.infragistics.com/marketing/Website/home/nielsen-logo.jpg
https://static.infragistics.com/marketing/Website/home/united-logo.jpeg
https://static.infragistics.com/marketing/Website/home/merrill-lynch-logo.PNG
https://static.infragistics.com/marketing/Website/home/dell-logo.TIFF
https://static.infragistics.com/marketing/Website/home/intel-logo.gif
https://static.infragistics.com/marketing/Website/home/prudential-logo.png
https://static.infragistics.com/marketing/Website/home/mcdonalds-logo.GIF
https://static.infragistics.com/marketing/Website/home/mcdonalds-alogo.GIF
https://static.infragistics.com/marketing/Website/home/mcdonalds-logos.GIF'
str.scan(re) do |match|
puts match.to_s
end
The expression is explained on the top right panel of regex101.com, if you wish to explore/simplify/modify it, and in this link, you can watch how it would match against some sample inputs, if you like.
RegEx Circuit
jex.im visualizes regular expressions:
Edit
For those cases that we have other instances of URLs, we would usually add more constraints for the edge cases, such as:
(?i)(?<=")\s*(?:https?)?:\/\/[^"]+\/(?=[^"]*logo)[^"]*(?:png|jpe?g|gif|tiff)\s*(?=")
DEMO
I guess maybe it would be easier to first collect the image URLs, then we would check if there is a \blogo\b in the image names. Otherwise, the expression might get complicated.
Related
I feel questions similar to this have been asked previously but not related to html like tags or in Lua 5.4.
I have a string <NS>my_file_path.py</NS> <NS>count</NS> <NS>type: :model</NS> <TS>do some counting</TS> and ideally I'll be able to pick specific tags (and everything between it) such as <NS>type: :model</NS>, and remove it from the string before doing any further formatting.
I'm guessing some matching with <NS>type: would be a start but how I stop at </NS> is the confusing part!
First of all: Do not attempt to parse HTML (or XML) with RegEx (or Lua patterns). Use libraries instead.
However, if you're only interested in removing innermost tags (i.e. "leaf" tags; tags without children), your tags are strictly formatted in this simple fashing as in your example (no <tag spacing or attributes inside="tag" > allowed) and the scope of your project is very limited, you could use string.gsub and a pattern to remove these tags:
str = str:gsub("<NS>type:.-</NS>", "")
Pattern explanation:
find substrings starting with "<NS>type:"
allow for arbitrary content - zero or more arbitrary characters (.); note that this has to be lazy (-) instead of greedy (*) to work
stop matching the substring at the first occurrence of </NS>, closing the tag; if you used a greedy quantifier before, this would have stopped at the last occurrence of </NS>, exceeding the tag
I am tryng to get rid of shortcodes inside a Google Sheet column. I have many items such as [spacer type="1" height="20"][spacer] or [FinalTilesGallery id="37"] I just would like to cancel them. Is there any simple way to do it?
Thanks !
For in-place replacement, the quick option would be to use the Find and Replace dialog (Ctrl + H) with Search Using Regular Expressions turned on, which is more powerful than your standard Find and Replace.
Find: \[.*?\] - Match anything within an open-bracket up to the very next close-bracket. This should work assuming you have no nested brackets, e.g. [[no][no]].
If you do have nested brackets, you'll have to change this to \[[^\[\]]*\]. And continue to Replace All until all the codes are gone.
Replace: Nothing.
Replace All. If you don't want to affect other sheets that may be in your document, make sure you select the right range to work with, too.
This just erases everything within the brackets.
If you want to erase any redundant spaces left by this, simply Find and Replace again (with Regular Expressions) on + (space and plus), which will match 1 or more spaces and replace with (single space).
E.g.:
string [] [] string2 -> string string2 after the shortcode replacement.
After replacing spaces, it will become string string2.
Let's say your original strings are in the range A2:A. Place the following into B2 of an otherwise completely empty Column B (or the second cell of any other empty column):
=ArrayFormula(IF(A2:A="",,TRIM(REGEXREPLACE(A2:A,"\[[^\[\]]+\]",""))))
I can't see your data, so I don't know what kind of information is between these shortcodes. If you find that this leaves you with concatenated pieces of data where there should be spaces between them, replace the above with this version:
=ArrayFormula(IF(A2:A="",,TRIM(REGEXREPLACE(SUBSTITUTE(SUBSTITUTE(A2:A,"["," ["),"]","] "),"\[[^\[\]]+\]",""))))
I can't teach regular expression language here. But I will note that, since square brackets have specific meaning within regex, your literal square brackets must be indicated with the escape character: the backslash.
Here is the regex expression alone:
\[[^\[\]]+\]
The opening \[ and the closing \], then, reference your actual opening and closing bracket sets. If we remove those, we have this left:
[^\[\]]+
Again, you see the escaped opening and closing square brackets, which I'll replace with the word these:
[^these]+
What remains there are opening and closing brackets with regex meaning, i.e., "anything in this group." And the circumflex symbol ^ as the first character within this set of square brackets means "anything except." The + symbol means "in any string length of one or more characters."
So that whole regex expression then reads: "A literal open square bracket, followed by one or more characters that are anything except square brackets, ending with a literal closing square bracket."
And we are REGEXREPLACE-ing any instance of that with "" (i.e., nothing).
I need to update a bilingual dictionary written in Writer by first parsing all entries into their parts e.g.
main word (font 1, bold)
foreign equivalent transliterated (font 1, italic)
foreign equivalent (font 2, bold)
part of speech (font 1, italic)
Each line of the document is the main word followed by the parts listed above, each separated by a space or punctuation.
I need to automate the process of walking through the whole file, line by line, and place a delimiter between each part, ignoring spaces and punctuation, so I can mass import it into a Calc file. In other words, "each part" is a sequence of character (ignoring spaces and punctuation) that have the same font AND font-style.
I have tried the standard Search&Replace feature, and AltSearch extension, but neither are able to complete the task. The main problem is I am not able to write a search query that says:
Find: consecutive characters with the same font AND font_style, ignore spaces and punctuation
Replace: term found above + "delimiter"
Any suggestions how I can write a script for this, or if an existing tool can solve the problem?
Thanks!
Pseudo code for desired effect:
var delimiter = "|"
Go to beginning of document
While not end of document do:
var $currLine = get line from doc
var $currChar = get next character which is not space or punctuation;
var $font = currChar.font
var $font_style - currChar.font_style (e.g. bold, italic, normal)
While not end of line do:
$currChar = next character which is not space or punctuation;
if (currChar.font != $font || currChar.font_style != $font_style) { // font or style has changed
print $delimiter
$font = currChar.font
$font_style - currChar.font_style (e.g. bold, italic, normal)
}
end While
end While
Here are tips for each of the things your pseudocode does.
First, the easiest way to move line by line is with the TextViewCursor, although it is slow. Notice the XLineCursor section. For the while loop, oVC.goDown() will return false when the end of the document is reached. (oVC is our variable for the TextViewCursor).
Get each character by calling oVC.goRight(0, False) to deselect followed by oVC.goRight(1, True) to select. Then the selected value is obtained by oVC.getString(). To ignore space and punctuation, perhaps use python's isalnum() or the re module.
To determine the font of the character, call oVC.getPropertyValue(attr). Values for attr could simply be CharAutoStyleName and CharStyleName to check for any changes in formatting.
Or grab a list of specific properties such as 'CharFontFamily', 'CharFontFamilyAsian', 'CharFontFamilyComplex', 'CharFontPitch', 'CharFontPitchAsian' etc. Character properties are described at https://wiki.openoffice.org/wiki/Documentation/DevGuide/Text/Formatting.
To insert the delimiter into the text: oVC.getText().insertString(oVC, "|", 0).
This python code from github shows how to do most of these things, although you'll need to read through it to find the relevant parts.
Alternatively, instead of using the LibreOffice API, unzip the .odt file and parse content.xml with a script.
In a wikipedia's article text, a link might be mentioned like: [Category:A B C], however the exact wiki url will have suffix like Category:A_B_C
From where I can get the information regarding all these rules which wiki uses to get the url from a link in its text ?(, e.g. converting spaces to underscores, capitalizing first letter, dealing with non-ascii characters etc)
Roughly the following:
Normalize namespace, e.g. category: --> Category:.
Uppercase the first letter of title proper, e.g. Category:foo --> Category:Foo. Note: this depends on wiki settings and titles are never uppercased on Wiktionary, for example.
Replace spaces with underscores, e.g. Foo bar --> Foo_bar.
Percent-encode all the usual characters with PHP's standard function urlencode(), except for the following ones: ;:#$!*(),/.
For full technical details you could look up this (function getLocalUrl()) and this (function wfUrlencode()).
There is no “etc.”, you already mentioned all the rules:
spaces are converted to underscores
the first letter of the article title is capitalized (the first letter of the namespace is capitalized too, if there is any)
the whole link is percent-encoded
Note that rules #1 and #2 are not mandatory: if you create your own URL that doesn't follow them, Wikipedia will still show the page correctly.
Things get more complicated if you include namespace aliases (WP:WikiProject Computing → Wikipedia:WikiProject_Computing) and interwiki links (wikia:gameofthrones:Westeros → http://www.wikia.com/wiki/c:gameofthrones:Westeros).
I'm using ruby and I'm trying to find a way to grab text in between the {start_grab_entries} and {end_grab_entries} like so:
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
Something like so:
$1 => "i want to grab
the text that
you see here in
the middle"
So far, I tried this as my regular expression:
\{start_grab_entries}(.|\n)*\{end_grab_entries}
However, using $1, that gives me a blank. Do you know what I can do to grab that block of text in between the tags correctly?
There is a better way to allow the dot to match newlines (/m modifier):
regexp = /\{start_grab_entries\}(.*?)\{end_grab_entries\}/m
Also, make the * lazy by appending a ?, or you might match too much if more than one such section occurs in your input.
That said, the reason why you got a blank match is that you repeated the capturing group itself; therefore you only caught the last repetition (in this case, a \n).
It would have "worked" if you had put the capturing group outside of the repetition:
\{start_grab_entries\}((?:.|\n)*)\{end_grab_entries\}`
but, as said above, there is a better way to do that.
I'm adding this because often we're reading data from a file or data-stream where the range of lines we want are not all in memory at once. "Slurping" a file is discouraged if the data could exceed the available memory, something that easily happens in production corporate environments. This is how we'd grab lines between some boundary markers as the file is being scanned. It doesn't rely on regex, instead using Ruby's "flip-flop" .. operator:
#!/usr/bin/ruby
lines = []
DATA.each_line do |line|
lines << line if (line['{start_grab_entries}'] .. line['{end_grab_entries}'])
end
puts lines # << lines with boundary markers
puts
puts lines[1 .. -2] # << lines without boundary markers
__END__
this is not captured
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
this is not captured either
Output of this code would look like:
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
i want to grab
the text that
you see here in
the middle
string=<<EOF
blah
{start_grab_entries}
i want to grab
the text that
you see here in
the middle
{end_grab_entries}
blah
EOF
puts string.scan(/{start_grab_entries}(.*?){end_grab_entries}/m)