Regular Expression seperate groups before a line of special characters - ios

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-
This PM was sent by [ helloworld ] hellworld#gmail.com,
Membership Status : YES
http://gg.com.zz/US?id=gg#1
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^-
Title : Testing is testing
Quantity : 44
Price : 55.00
Item Location : United States
*******************************************************************
I want this message right here, hello there, you help is deeply
**
appreciated :)
*** This email was sent using gg.gg.com ***
Above would be my output string, I wish to get groups between the long ^^^^^^- and ****** divider,
End result would be:
This PM was sent by [ helloworld ] hellworld#gmail.com,
Membership Status : YES
http://gg.com.zz/US?id=gg#1
Title : Testing is testing
Quantity : 44
Price : 55.00
Item Location : United States
I want this message right here, hello there, you help is deeply
**^
appreciated :)
I had try (?<=^)[^\^]*|[^\^-]*(?<=\*\*) but just couldn't match the whole long ^^^^^^^ divider, can anybody help me with this?

You can use this regex to capture your intended data,
(?s)^(?:\^+-|\*{3,})\s*(.+?)(?=\s*(?:\^+|\*{3,}))
Explanation:
(?s) - Enables . to match newline character which is required here as the data to be captured spans across multiple lines
^ - Matches start of text
(?:\^+-|\*{3,})\s* - Matches one or more ^ characters ending with - or three (why three so the last line doesn't match as it has 2 stars) or more * characters followed by optional whitespace
(.+?) - Matches the intended text and captures it in first grouping pattern
(?=\s*(?:\^+|\*{3,})) - Look ahead to ensure it stops capturing the data followed by optional whitespace and above pattern like ^^^^^- or *****
Although my previous answer also worked, but this is even better as it neatly captures data.
Demo

Related

Handling arbitrary text blocks in an Xtext grammar

In an effort to better understand Xtext, I'm working on writing a grammar and have hit a roadblock. I've boiled it down to the following scenario. I have some input such as this:
thing {abc}
{def}
There may be keywords (e.g.'thing') followed by other language elements (e.g. ID) in braces. Or, there can just be a block of content inside braces. This content should simply be passed along to the parser en masse.
If I try something like this:
Model: (things+=AThing | blocks+=ABlock)*;
AThing : 'thing' '{' name = ID '}';
ABlock : block=BLOCK;
terminal BLOCK:'{' -> '}';
and parse the sample text above, I get an error:
'mismatched input '{abc}' expecting '{'' on ABlock, offset 6, length 5
So, '{abc}' is being matched by the BLOCK terminal rule, which I understand. But how do I alter the grammar to properly handle the sample input? I've been wrestling with this problem for a while and have come up empty. So it's either something very simple that I've missed, or the problem is really complex and I don't realize it. Any enlightenment would be greatly appreciated.
Parsing happens in two stages: tokenizer and lexical. In the first one the text input is divided into tokens, in the second one the tokens are matched against lexical rules. Broadly something like (with some arbitrary language):
1st phase:
text: class X { this ; }
----- --- --- ---- --- ---
tokens: ID ID LB ID SC RB
2nd phase:
Is there a rule that starts with a 'class' string?
YES: Is the next expected token an ID?
YES: Is the next expected token a LB?
...
NO: Is there another rule that starts with 'class'?
...
NO: Is there a rule that starts with an ID token?
...
The lexer implementation is a bit more complex, but I hope you get the idea.
The issue with your grammar is that your termial BLOCK rule is used during the first phase, hence you get
thing {abc} {def}
----- ----- -----
ID BLOCK BLOCK
That is why the error message says if found '{abc}' and not a '{'. The lexer matched the thing and was expecting the next token to be a '{' but it got a BLOCK.
If you want arbitrary text inside the block, I don't think you can use '{' to identify the name of things.
This looks like what is mentioned here:
A quite common case requiring backtracking is when your language uses the same delimiter pair for two different semantics
So the simplest solution seems to use different delimiters. Otherwise you may have to look into enabling backtracking.

Alter regex markup to not separate float numbers (like 2.0)

I was looking for a solution to a regex problem in Rails I had and an answer on a separate question lead me 90% of the path to the answer. Basically, what I would like to do is to have a ruby/rails script that will format a messy text in terms of capitalizing every letter after a "./,/!/?". This code by "Mark S"
ng = Nokogiri::HTML.fragment("<p>hello, how are you? oh, that's nice! i am glad you are fine. i am too.<br />i am glad to have met you.</p>")
ng.traverse{|n| (n.content = n.content.gsub(/(.*?)([\.|\!|\?])/) { " #{$1.strip.capitalize}#{$2}" }.strip) if n.text?}
ng.to_s
The only issue I have with this code, and it is a big issue, is that the code adds a space in between float numbers like "2.0", making a text like:
there is a cat in the hat.it has a 2.0 inch tail!
isn't that awesome?!I think so.
Become
There is a cat i the hat. It has a 2. 0 inch tail!
Isn't that awesome?! I think so.
where I obviously want it to be:
There is a cat i the hat. It has a 2.0 inch tail!
Isn't that awesome?! I think so.
Any suggestions on how to alter this text, for example so that any "." will be ignored by this code?
It seems you want to capitalize any lowercase letter at the beginning of the string or after ., !, or ?.
Use
s.gsub(/(\A|[.?!])(\p{Ll})/) { Regexp.last_match(1).length > 0 ? "#{$1} #{$2.capitalize}" : "#{$2.capitalize}" }
See the Ruby demo
Pattern details:
(\A|[.?!]) - Group 1 capturing the start of string location (empty string) or a ., ?, or !
(\p{Ll}) - Group 2 capturing any Unicode lowercase letter
Inside the replacement, we check if Group 1 value is not empty, and if it is, we just return the capitalized letter. Else, return the punctuation, a space, and the capitalized letter.
NOTE: However, there is a problem with abbreviations (as usual in these cases), like i.e., e.g., etc. Then there are words like iPhone, iCloud, eSklep, and so on.

Extracting text from APA citation

I have a spreadsheet containing APA citation style text and I want to split them into author(s), date, and title.
An example of a citation would be:
Parikka, J. (2010). Insect Media: An Archaeology of Animals and Technology. Minneapolis: Univ Of Minnesota Press.
Given this string is in field I2 I managed to do the following:
Name: =LEFT(I2, FIND("(", I2)-1) yields Parikka, J.
Date: =MID(I2,FIND("(",I2)+1,FIND(")",I2)-FIND("(",I2)-1) yields 2010
However, I am stuck at extracting the name of the title Insect Media: An Archaeology of Animals and Technology.
My current formula =MID(I2,FIND(").",I2)+2,FIND(").",I2)-FIND(".",I2)) only returns the title partially - the output should show every character between ).and the following ..
I tried =REGEXEXTRACT(I2, "\)\.\s(.*[^\.])\.\s" ) and this generally works but does not stop at the first ". " - Like with this example:
Sanders, E. B.-N., Brandt, E., & Binder, T. (2010). A framework for organizing the tools and techniques of participatory design. In Proceedings of the 11th biennial participatory design conference (pp. 195–198). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=1900476
Where is the mistake?
The title can be found (in the two examples you've given, at least) with this:
=MID(I2,find("). ",I2)+3,find(". ",I2,find("). ",I2)+3)-(find("). ",I2)+3)+1)
In English: Get the substring starting after the first occurrence of )., up to and including the first occurrence of . following.
If you wish to use REGEXEXTRACT, then this works (on your two examples). (You can also see a Regex101 demo.):
=REGEXEXTRACT(I3,"(?:.*\(\d{4}\)\.\s)([^.]*\.)(?: .*)")
Where is the mistake?
In your expression, you were capturing (.*[^\.]), which greedily includes any number of characters followed by a character in the character class not (backslash or dot), which means that multiple sentences can be captured. The expression finished with \.\s, which wasn't captured, so the capture group would end before a period-then-space, rather than including it.
Try:
=split(SUBSTITUTE(SUBSTITUTE(I2, "(",""), ")", ""),".")
If you don't replace the parentheses around 2010, it thinks it is a negative number -2010.
For your Title try adding index split to your existing formula:
=index(split(REGEXEXTRACT(A5, "\)\.\s(.*[^\.])\.\s" ),"."),0,1)&"."

How to find 2 Assignment operator in regex

I am trying to find 2 symbols together "+*" , "-/", or such and also I want to identify if it's "3-", "4-" "*4" and such. I will be looking for it inside and array or strings like such ["2" , "+", "3","/" , "2"]
If I understand your question correctly, you are trying to match a symbol followed by a number or a number followed by a symbol
the regex would look something like this
/^[+-\/\*]\d$|^\d[+-\/\*]$/
Breakdown
^ - Start of line
[+-\/\*] - Any one of the symbols. Asterisk and forward slash must be escaped
\d - Matches any digit (0 through 9)
$ - End of line
| - Or
^\d[+-\/\*]$ - starts with a digit and ends with a symbol.
Please let me know if this is what you are looking for. Otherwise I can fix this.
In Ruby, let's pretend you have an array as follows
array = ["2" , "+", "3","/" , "2"]
You can find if any two consecutive elements match the above pattern as follows
array.each_cons(2).to_a.any? { |combo| combo.join.match(/^[+-\/\*]\d$|^\d[+-\/\*]$/) }
Breakdown
Use the each_cons(2) function to find every two consecutive characters in the array
use the any? method to find if any elements in the array satisfy a condition
Iterate over every element and find if any of the two joined together match the regex pattern
I don't get the second part about "3-" etc. But the basic idea for the rest is:
your_array.each do |element|
result element.match([/\+\/-]{2}/)
end
Note that the following characters have to be escaped with a backslash when used in ruby:
. | ( ) [ ] { } + \ ^ $ * ?.

Checking whether a string contains a phone number

Trying to work out how to parse out phone numbers that are left in a string.
e.g.
"Hi Han, this is Chewie, Could you give me a call on 02031234567"
"Hi Han, this is Chewie, Could you give me a call on +442031234567"
"Hi Han, this is Chewie, Could you give me a call on +44 (0) 203 123 4567"
"Hi Han, this is Chewie, Could you give me a call on 0207-123-4567"
"Hi Han, this is Chewie, Could you give me a call on 02031234567 OR +44207-1234567"
And be able to consistently replace any one of them with some other item (e.g. some text, or a link).
Am assuming it's a regex type approach (I'm already doing something similar with email which works well).
I've got to
text.scan(/([^A-Z|^"]{6,})/i)
Which leaves me a leading space I can't work out how to drop (would appreciate the help there).
Is there a standard way of doing this that people use?
It also drops things into arrays, which isn't particularly helpful
i.e. if there were multiple numbers.
[["02031234567"]["+44207-1234567"]]
as opposed to
["02031234567","+44207-1234567"]
Adding in the third use-case with spaces is difficult. I think the only way to successfully meet that acceptance criteria would be to chain a #gsub call on to your #scan.
Thus:
text.gsub(/\s+/, "").scan(/([^A-Z|^"|^\s]{6,})/i)
The following code will extract all the numbers for you:
text.scan(/(?<=[ ])[\d \-+()]+$|(?<=[ ])[\d \-+()]+(?=[ ]\w)/)
For the examples you supplied this results in:
["02031234567"]
["+442031234567"]
["+44 (0) 203 123 4567"]
["0207-123-4567"]
["02031234567", "+44207-1234567"]
To understand this regex, what we are matching is:
[\d \-+()]+ which is a sequence of one or more digits, spaces, minus, plus, opening or closing brackets (in any order - NB regex is greedy by default, so it will match as many of these characters next to each other as possible)
that must be preceded by a space (?<=[ ]) - NB the space in the positive look-behind is not captured, and therefore this makes sure that there are no leading spaces in the results
and is either at the end of the string $, or | is followed by a space then a word character (?=[ ]\w) (NB this lookahead is not captured)
This pattern will get rid of the space but not match your third case with spaces:
/([^A-Z|^"|^\s]{6,})/i
This is what I came to in the end in case it helps somebody
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
That gives me an array of
["+442031234567", "02031234567"]
I'm sure there is a more elegant way of doing this and possibly you'd want to check the numbers for likelihood of being phonelike - e.g. using the brilliant Phony gem.
numbers = text.scan(/([^A-Z|^"]{6,})/i).collect{|x| x[0].strip }
real_numbers = numbers.keep_if{|n| Phony.plausible? PhonyRails.normalize_number(n, default_country_code: "GB")}
Which should help exclude serial numbers or the like from being identified as numbers. You'll obviously want to change the country code to something relevant for you.

Resources