Searching for files with a specific pattern

Searching for files with a specific pattern - ruby-on-rails

I'm trying to look for specific files in a directory using a pattern
Lets say i have the id of the user - 101
here are my files
101
101_2
101_5
10111
103
10125
101_6
I'm trying to form a regex pattern which only gives me files (101,101_2,101_5,101_6)
I'm trying the below pattern
^101_?\d+$
but it doesnt seem to pick any of the files at all. if i remove the ^.only 101_6 matches for some reason.
EDIT:
I'm using rails/ruby to look for files in the particular directory. so something like
Dir.glob(location).grep("^101_?\d+$")
do something
end

If location isn't the current folder, paths returned by glob will contain dirname and basename :
Dir.glob('./*').select{ |f| File.basename(f) =~ /\A101(_\d+)?\z/ }.each do |f|
puts f
# do something with f
end

Your question isn't particularly clear, but I'm guessing you want to match anything which is 101 followed by an optional underscore and a digit. If so, use the regex ^101_?\d$. If you want 101 followed by either a digit or an underscore and one or more digits, use ^101(_\d+|\d)$
EDIT
As the OP has mentioned in a comment, 101 should also be matched. The updated regex is ^101(?:_?\d)?$

Related

What is "map" matching exactly in Rack?

Say I have this in Rack:
map '/great' do
run Test.new
end
This URL works great: site.com/great/stuff but this does NOT: site.com/greatstuff. I've read that map should match anything that STARTS WITH the the arg name, but this doesn't seem to be the case, with cases like these.
Is there any detailed specification on how this works?

The confusion seems to be conceptual.
It does match paths starting with /great. That is /great, /great/, /great/stuff and so on.
What it doesn't do is match strings starting with /great. Like /greatstuff.
/greatstuff and /greatare completely different paths. Think of paths as a tree structure.
There is no way to do "string path matching" with barebones map AFAIK, but you could add your own rack middleware that looks at the request path and dispatches appropriately.
If you want to double check the implementation, here are the two relevant places: 1, 2.
Regexp.new("^#{Regexp.quote(location).gsub('/', '/+')}(.*)", nil, 'n')
This basically creates a regex out of a path that requires a string to start with that path (multiple / ignored) in order to match. Aka:
to_regex('/foo/bar') # => /^\/+foo\/+bar(.*)/n
In case you are wondering, the n flag sets the encoding to ASCII.
If the string matches, a few more checks are performed. Namely that the remainder of the matched path is either non-existent or starts with /. The latter ensures that you won't match things like /greatstuff with /great, as stuff doesn't start with /.
next unless !rest || rest.empty? || rest[0] == ?/

Don't match dot in beginning of string

I have one path in form of string like this Folder1/File.png
But in this string sometimes if file is hidden or folder is hidden I don't want it to be matched by my regex.
regex = %r{([a-zA-Z0-9_ -]*)\/[^.]+$}
input_path = "Folder_1/.file" # This shouldn't be matched.
input_path = "Folder/file.png" # This should be matched.
But my regex works for first input but its not even matching second one.

You are currently looking for \/[^.]+$, that is a / followed by any character except . until the end. Since the filename+extension format has a . character, it fails to match the second case.
Instead of using [^.]+$, check only that the character following / is not ., and match everything after that:
([a-zA-Z0-9_ -]*)\/[^.].*$

While there are some suggestions here that work, my suggestion would be
\/[^.][^\/\n]+$
It finds a slash, followed by anything but a dot, which in turn is followed by one, or more, of anything but a slash or a newline.
To handle the two lines given as an example,
Folder_1/.file
Folder/file.png
it takes 8 steps.
The suggested ones all work, but ([a-zA-Z0-9_ -]*)\/[^.] takes 75 steps, ([a-zA-Z0-9_ -]*)\/[^.]+\.[^.]+\z 78 steps and ([a-zA-Z0-9_ -]*)\/[^.].*$ takes 77 steps.
This may be totally irrelevant and I may have missed some angle, but I wanted to mention it ;)
Se it here at regex101.

regex = %r{([a-zA-Z0-9_ -]*)\/[^.]}

Remove contents within a specific tag

Using Rails 3.2. I want to remove all text in <b> and the tags, but I manage to find ways to strip the tags only.:
string = "
<p>
<b>Section 1</b>
Everything is good.<br>
<b>Section 2</b>
All is well.
</p>"
string.strip_tags
# => "Section 1 Everthing is good. Section 2 All is well."
I want to achieve this:
"Everthing is good. All is well."
Should I add regex matching too?

The "right" way would be to use an html parser like Nokogiri.
However for this simple task, you may use a regex. It's quite simple:
Search for : (?m)<b\s*>.*?<\/b\s*> and replace it with empty string. After that, use strip_tags.
Regex explanation:
(?m) # set the m modifier to match newlines with dots .
<b # match <b
\s* # match a whitespace zero or more times
> # match >
.*? # match anything ungreedy until </b found
<\/b # match </b
\s* # match a whitespace zero or more times
> # match >
Online demo

It would be much better to use an HTML/XML parser for this task. Ruby does not have a native one, but Nokogiri is good and wraps libxml/xslt
doc = Nokogiri::XML string
doc.xpath("//b").remove
result = doc.text # or .inner_html to include `<p>`

You can do string.gsub(/<b>.*<\/b>/, '')
http://rubular.com/r/hhmpY6Q6fX

if you want to remove tags you can try this :
ActionController::Base.helpers.sanitize("test<br>test<br>test<br> test")
if you want to remove all the tags you need to use this :
ActionView::Base.full_sanitizer.sanitize("test<br>test<br>test<br> test")
these two differ slightly.the first one is good for script tags to prevent Xss attacks but it doesn't remove tages. the second one removes any html tags in the text.

Easiest way to remove Latex tag (but not its content)?

I am using TeXnicCenter to edit a LaTeX document.
I now want to remove a certain tag (say, emph{blabla}} which occurs multiple times in my document , but not tag's content (so in this example, I want to remove all emphasization).
What is the easiest way to do so?
May also be using another program easily available on Windows 7.
Edit: In response to regex suggestions, it is important that it can deal with nested tags.
Edit 2: I really want to remove the tag from the text file, not just disable it.

Using a regular expression do something like s/\\emph\{([^\}]*)\}/\1/g. If you are not familiar with regular expressions this says:
s -- replace
/ -- begin match section
\\emph\{ -- match \emph{
( -- begin capture
[^\}]* -- match any characters except (meaning up until) a close brace because:
[] a group of characters
^ means not or "everything except"
\} -- the close brace
and * means 0 or more times
) -- end capture, because this is the first (in this case only) capture, it is number 1
\} -- match end brace
/ -- begin replace section
\1 -- replace with captured section number 1
/ -- end regular expression, begin extra flags
g -- global flag, meaning do this every time the match is found not just the first time
This is with Perl syntax, as that is what I am familiar with. The following perl "one-liners" will accomplish two tasks
perl -pe 's/\\emph\{([^\}]*)\}/\1/g' filename will "test" printing the file to the command line
perl -pi -e 's/\\emph\{([^\}]*)\}/\1/g' filename will change the file in place.
Similar commands may be available in your editor, but if not this will (should) work.

Crowley should have added this as an answer, but I will do that for him, if you replace all \emph{ with { you should be able to do this without disturbing the other content. It will still be in braces, but unless you have done some odd stuff it shouldn't matter.
The regex would be a simple s/\\emph\{/\{/g but the search and replace in your editor will do that one too.
Edit: Sorry, used the wrong brace in the regex, fixed now.

\renewcommand{\emph}[1]{#1}

any reasonably advanced editor should let you do a search/replace using regular expressions, replacing emph{bla} by bla etc.

How to make a Ruby string safe for a filesystem?

I have user entries as filenames. Of course this is not a good idea, so I want to drop everything except [a-z], [A-Z], [0-9], _ and -.
For instance:
my§document$is°° very&interesting___thisIs%nice445.doc.pdf
should become
my_document_is_____very_interesting___thisIs_nice445_doc.pdf
and then ideally
my_document_is_very_interesting_thisIs_nice445_doc.pdf
Is there a nice and elegant way for doing this?

I'd like to suggest a solution that differs from the old one. Note that the old one uses the deprecated returning. By the way, it's anyway specific to Rails, and you didn't explicitly mention Rails in your question (only as a tag). Also, the existing solution fails to encode .doc.pdf into _doc.pdf, as you requested. And, of course, it doesn't collapse the underscores into one.
Here's my solution:
def sanitize_filename(filename)
# Split the name when finding a period which is preceded by some
# character, and is followed by some character other than a period,
# if there is no following period that is followed by something
# other than a period (yeah, confusing, I know)
fn = filename.split /(?<=.)\.(?=[^.])(?!.*\.[^.])/m
# We now have one or two parts (depending on whether we could find
# a suitable period). For each of these parts, replace any unwanted
# sequence of characters with an underscore
fn.map! { |s| s.gsub /[^a-z0-9\-]+/i, '_' }
# Finally, join the parts with a period and return the result
return fn.join '.'
end
You haven't specified all the details about the conversion. Thus, I'm making the following assumptions:
There should be at most one filename extension, which means that there should be at most one period in the filename
Trailing periods do not mark the start of an extension
Leading periods do not mark the start of an extension
Any sequence of characters beyond A–Z, a–z, 0–9 and - should be collapsed into a single _ (i.e. underscore is itself regarded as a disallowed character, and the string '$%__°#' would become '_' – rather than '___' from the parts '$%', '__' and '°#')
The complicated part of this is where I split the filename into the main part and extension. With the help of a regular expression, I'm searching for the last period, which is followed by something else than a period, so that there are no following periods matching the same criteria in the string. It must, however, be preceded by some character to make sure it's not the first character in the string.
My results from testing the function:
1.9.3p125 :006 > sanitize_filename 'my§document$is°° very&interesting___thisIs%nice445.doc.pdf'
=> "my_document_is_very_interesting_thisIs_nice445_doc.pdf"
which I think is what you requested. I hope this is nice and elegant enough.

From http://web.archive.org/web/20110529023841/http://devblog.muziboo.com/2008/06/17/attachment-fu-sanitize-filename-regex-and-unicode-gotcha/:
def sanitize_filename(filename)
returning filename.strip do |name|
# NOTE: File.basename doesn't work right with Windows paths on Unix
# get only the filename, not the whole path
name.gsub!(/^.*(\\|\/)/, '')
# Strip out the non-ascii character
name.gsub!(/[^0-9A-Za-z.\-]/, '_')
end
end

In Rails you might also be able to use ActiveStorage::Filename#sanitized:
ActiveStorage::Filename.new("foo:bar.jpg").sanitized # => "foo-bar.jpg"
ActiveStorage::Filename.new("foo/bar.jpg").sanitized # => "foo-bar.jpg"

If you use Rails you can also use String#parameterize. This is not particularly intended for that, but you will obtain a satisfying result.
"my§document$is°° very&interesting___thisIs%nice445.doc.pdf".parameterize

For Rails I found myself wanting to keep any file extensions but using parameterize for the remainder of the characters:
filename = "my§doc$is°° very&itng___thsIs%nie445.doc.pdf"
cleaned = filename.split(".").map(&:parameterize).join(".")
Implementation details and ideas see source: https://github.com/rails/rails/blob/master/activesupport/lib/active_support/inflector/transliterate.rb
def parameterize(string, separator: "-", preserve_case: false)
# Turn unwanted chars into the separator.
parameterized_string.gsub!(/[^a-z0-9\-_]+/i, separator)
#... some more stuff
end

If your goal is just to generate a filename that is "safe" to use on all operating systems (and not to remove any and all non-ASCII characters), then I would recommend the zaru gem. It doesn't do everything the original question specifies, but the filename produced should be safe to use (and still keep any filename-safe unicode characters untouched):
Zaru.sanitize! " what\ēver//wëird:user:înput:"
# => "whatēverwëirduserînput"
Zaru.sanitize! "my§docu*ment$is°° very&interes:ting___thisIs%nice445.doc.pdf"
# => "my§document$is°° very&interesting___thisIs%nice445.doc.pdf"

There is a library that may be helpful, especially if you're interested in replacing weird Unicode characters with ASCII: unidecode.
irb(main):001:0> require 'unidecoder'
=> true
irb(main):004:0> "Grzegżółka".to_ascii
=> "Grzegzolka"

Develop Reference

ios ruby-on-rails asp.net-mvc docker delphi jenkins grails google-sheets machine-learning dart

Searching for files with a specific pattern - ruby-on-rails

If location isn't the current folder, paths returned by glob will contain dirname and basename : Dir.glob('./*').select{ |f| File.basename(f) =~ /\A101(_\d+)?\z/ }.each do |f| puts f # do something with f end

Related

What is "map" matching exactly in Rack?

Don't match dot in beginning of string

Remove contents within a specific tag

Easiest way to remove Latex tag (but not its content)?

How to make a Ruby string safe for a filesystem?

Categories

Resources