Home > Articles > Programming > Ruby

  • Print
  • + Share This
This chapter is from the book

2.8 Tokenizing a String

The split method parses a string and returns an array of tokenized strings. It accepts two parameters: a delimiter and a field limit (which is an integer).

The delimiter defaults to whitespace. Actually, it uses $; or the English equivalent $FIELD_SEPARATOR. If the delimiter is a string, the explicit value of that string is used as a token separator:

s1 = "It was a dark and stormy night."
words = s1.split          # ["It", "was", "a", "dark", "and",
                          #  "stormy", "night"]
s2 = "apples, pears, and peaches"
list = s2.split(", ")     # ["apples", "pears", "and peaches"]

s3 = "lions and tigers and bears"
zoo = s3.split(/ and /)   # ["lions", "tigers", "bears"]

The limit parameter places an upper limit on the number of fields returned, according to these rules:

  • If it is omitted, trailing null entries are suppressed.
  • If it is a positive number, the number of entries will be limited to that number (stuffing the rest of the string into the last field as needed). Trailing null entries are retained.
  • If it is a negative number, there is no limit to the number of fields, and trailing null entries are retained.

These three rules are illustrated here:

str = "alpha,beta,gamma,,"
list1 = str.split(",")     # ["alpha","beta","gamma"]
list2 = str.split(",",2)   # ["alpha", "beta,gamma,,"]
list3 = str.split(",",4)   # ["alpha", "beta", "gamma", ","]
list4 = str.split(",",8)   # ["alpha", "beta", "gamma", "", ""]
list5 = str.split(",",-1)  # ["alpha", "beta", "gamma", "", ""]

Similarly, the scan method can be used to match regular expressions or strings against a target string:

str = "I am a leaf on the wind..."

# A string is interpreted literally, not as a regex
arr = str.scan("a")   # ["a","a","a"]

# A regex will return all matches
arr = str.scan(/\w+/)
# ["I", "am", "a", "leaf", "on", "the", "wind"]

# A block will be passed each match, one at a time
str.scan(/\w+/) {|x| puts x }

The StringScanner class, from the standard library, is different in that it maintains state for the scan rather than doing it all at once:

require 'strscan'
str = "Watch how I soar!"
ss = StringScanner.new(str)
loop do
  word = ss.scan(/\w+/)    # Grab a word at a time
  break if word.nil?
  puts word
  sep = ss.scan(/\W+/)     # Grab next non-word piece
  break if sep.nil?
  • + Share This
  • 🔖 Save To Your Account