scan and match method returns differently when capturing groups in Ruby using regex

Yingqi Chen
3 min readApr 30, 2020

--

I have some problems when writing a Ruby method that can print the extension of a file. During the process of exploring, I found something interesting about the scan and match method. This is what this article is about.

But also, I will show you Ruby’s built-in method for that. If you are interested in that, you can scroll down to the SOLUTION part.

Problem

Ok, long story short, ideally, the method should work like this:

print_extension("test.rb")=>rb

To dissect the file name argument into the filename and extension, I decided to use regex and capture the extension name as a group. To use regex in Ruby, I have two options, scan method and match method. So I write the following method to compare what I will get:

def test(str)

word_match = str.match(/\.(\w*)/)

word_scan = str.scan(/\.(\w*)/)

puts word_match, word_scan

end

test("test.rb")

At first, I thought I would get the same result. However, here is what I get:

.rb
rb

So that means when I use match method, I still get the . back while when using scan, there is no such problem. What is happening here?

Reason

The reason lies in their different return values. To better describe what is going on, I want to show you two different situations:

CAPTURE ITEMS

In the simplest case, let’s say you just want to use them to capture a word with a certain pattern instead of capturing a group, scanreturns an array while matchreturns amatchData. If I change my test method in this way then you will see the difference:

def test(str)

word_match = str.match(/\.\w*/).size

word_scan = str.scan(/\.\w*/).size

puts word_match, word_scan

end

test("test.rb test.rb")
=> 1
=> 2

word_match returns ONLY the first item that matches that regex while word_scan returns ALL that matches. So in this case, with the target string “test.rb test.rb”, match returns only the first match while scan returns two.

CAPTURE GROUPS

In the situation where I capture groups (and thus the parenthesis around the \w*), I show you what exactly they return with inspect method:

def test(str)  word_match = str.match(/\.(\w*)/).inspect
word_scan = str.scan(/\.(\w*)/).inspect
puts word_match, word_scan
endtest("test.rb")#<MatchData ".rb" 1:"rb">
[["rb"]]

The return values are a bit different. scan method still returns an array but each element will be an array also. The match method, however, will return a MatchData with two elements: the whole matched string and the captured item(even with more than one matched item, it will still return these TWO elements ONLY.)

Now it is interesting to look at what is returned from the first example:

def test(str)

word_match = str.match(/\.(\w*)/)

word_scan = str.scan(/\.(\w*)/)

puts word_match, word_scan

end

test("test.rb")
=>.rb
=> rb

scan method by default returns all the matched items but it strips off the square bracket and just shows you the most inside captured group.

match method by default returns only the first element in its MatchData object, so it only returns the whole matched string, which is .rb .

Above are what I found through experiment. I would love to know where in the Ruby document mention that, like why they by default return this. So if you know, please DM me😀

SOLUTION

So we already know what is happening. If you want to reach the captured part with match, you can simply do: word_match = str.match(/\.(\w*)/)[1] to access the captured group within MatchData object. There are also two built-in methods that can be used:

#captures

According to the Ruby document about MatchData, you can simply use captures method in this way:

word_match = str.match(/\.(\w*)/).captures

This will return the captured group.

#extname(path)

Or, if you are ok with the . in the front of the extension, with the help of the extname method offered by File class, you can do:

File.extname("test.rb") => ".rb"

Recap

So this article talks about the scan and match method, scan returns an array of ALL matched items while match returns ONLY the first item. But when they return captured group using Regex, scan returns an array of arrays with all matched items in it and match returns a MatchData object which includes the whole string that matches the data and also the captured strings. You can use #captures to directly access the captured group in match, or #extname if you want to keep the dot in extension.

Thanks for reading this!

--

--

Yingqi Chen
Yingqi Chen

Written by Yingqi Chen

Software engineer and a blockchain noob. Excited about the new world!!LinkedIn:https://www.linkedin.com/in/yingqi-chen/

No responses yet