scan and match method returns differently when capturing groups in Ruby using regex
I have some problems when writing a Ruby method that can print the extension of a file. During the process of exploring, I found something interesting about the scan
and match
method. This is what this article is about.
But also, I will show you Ruby’s built-in method for that. If you are interested in that, you can scroll down to the SOLUTION part.
Problem
Ok, long story short, ideally, the method should work like this:
print_extension("test.rb")=>rb
To dissect the file name argument into the filename and extension, I decided to use regex and capture the extension name as a group. To use regex in Ruby, I have two options, scan
method and match
method. So I write the following method to compare what I will get:
def test(str)
word_match = str.match(/\.(\w*)/)
word_scan = str.scan(/\.(\w*)/)
puts word_match, word_scan
end
test("test.rb")
At first, I thought I would get the same result. However, here is what I get:
.rb
rb
So that means when I use match
method, I still get the .
back while when using scan
, there is no such problem. What is happening here?
Reason
The reason lies in their different return values. To better describe what is going on, I want to show you two different situations:
CAPTURE ITEMS
In the simplest case, let’s say you just want to use them to capture a word with a certain pattern instead of capturing a group, scan
returns an array
while match
returns amatchData
. If I change my test
method in this way then you will see the difference:
def test(str)
word_match = str.match(/\.\w*/).size
word_scan = str.scan(/\.\w*/).size
puts word_match, word_scan
end
test("test.rb test.rb")=> 1
=> 2
word_match
returns ONLY the first item that matches that regex while word_scan
returns ALL that matches. So in this case, with the target string “test.rb test.rb”, match
returns only the first match while scan
returns two.
CAPTURE GROUPS
In the situation where I capture groups (and thus the parenthesis around the \w*
), I show you what exactly they return with inspect
method:
def test(str) word_match = str.match(/\.(\w*)/).inspect
word_scan = str.scan(/\.(\w*)/).inspect
puts word_match, word_scanendtest("test.rb")#<MatchData ".rb" 1:"rb">
[["rb"]]
The return values are a bit different. scan
method still returns an array but each element will be an array also. The match
method, however, will return a MatchData
with two elements: the whole matched string and the captured item(even with more than one matched item, it will still return these TWO elements ONLY.)
Now it is interesting to look at what is returned from the first example:
def test(str)
word_match = str.match(/\.(\w*)/)
word_scan = str.scan(/\.(\w*)/)
puts word_match, word_scan
end
test("test.rb")=>.rb
=> rb
scan
method by default returns all the matched items but it strips off the square bracket and just shows you the most inside captured group.
match
method by default returns only the first element in its MatchData
object, so it only returns the whole matched string, which is .rb
.
Above are what I found through experiment. I would love to know where in the Ruby document mention that, like why they by default return this. So if you know, please DM me😀
SOLUTION
So we already know what is happening. If you want to reach the captured part with match
, you can simply do: word_match = str.match(/\.(\w*)/)[1]
to access the captured group within MatchData
object. There are also two built-in methods that can be used:
#captures
According to the Ruby document about MatchData, you can simply use captures
method in this way:
word_match = str.match(/\.(\w*)/).captures
This will return the captured group.
#extname(path)
Or, if you are ok with the .
in the front of the extension, with the help of the extname
method offered by File
class, you can do:
File.extname("test.rb") => ".rb"
Recap
So this article talks about the scan
and match
method, scan
returns an array of ALL matched items while match
returns ONLY the first item. But when they return captured group using Regex, scan
returns an array of arrays with all matched items in it and match
returns a MatchData
object which includes the whole string that matches the data and also the captured strings. You can use #captures
to directly access the captured group in match
, or #extname
if you want to keep the dot in extension.
Thanks for reading this!