Приглашаем посетить
Брюсов (bryusov.lit-info.ru)

13.3 Verifying email addresses using pattern matching

Table of Contents

Previous Next

13.3 Verifying email addresses using pattern matching

13.3.1 Pattern matching and emailing

In the early days of Web design and businesses, one of the daily problems that the network administrators faced was the increasing amount of invalid email addresses. Unlike other information, an email address has to be precise and complete so that mail can be delivered properly. Just imagine how many times you have put spaces, commas, semi-colons, and more than one "@" symbol in an email address. Before the advent of so-called fourth-generation browser software, e.g., IE4+ and NS4+, to pick up these errors and validate an email address was a difficult problem.

One effective and convenient way of validating an email address is by means of pattern matching. Some people may prefer to call it "Regular Expressions," a term originating from Perl and adopted by JScript and ECMAScript. Pattern matching has always been a popular and powerful tool for administrators of UNIX systems and every Perl programmer. In this section, we will look at some practical pattern-matching techniques and their applications. These techniques can be used to analyze the user's input and verify a malformed email address.

One straightforward method to validate an email address is to see whether an address contains the "@" symbol. The following is an example:



Listing: ex13-02.txt

1: <script>
2: function isEmail(valSt)
3: {
4:  var patSt = /@/
5:  return( patSt.test(valSt) )
6: }
7: </script>

This is a simple script function. The input argument valSt is a string to be searched and matched by some predefined patterns. As a special feature of scripting, anything between a double slash defines a pattern or a substring. In line 4 there is a pattern with a single character "@." The statement in line 5



patSt.test(valSt)

uses the test method to search the string valSt against and match the pattern stored in the variable patSt. It returns a false value if no "@" symbol is detected in valSt.

13.3.2 Eliminating malformed addresses with quantifiers

The real power of the pattern-matching technique is to validate strings with quantifiers. Quantifiers are special notation used to indicate how the pattern should be matched. For example, the following



/(@){2}/

matches "@@". Quantifiers only apply to the previous pattern. Thus the following pattern matches the string "JohnSmithh":



"JohnSmith{2}"

If you want a quantifier to apply to a string of multiple characters, you must group them together by means of braces. For example,

"John(Smith){2}" matches the string "JohnSmithSmith"

Table 13.1 contains some frequently used quantifiers.

Quantifiers

Quantifier

Description

{m,n}

Must occur at least m times, but not more than n times

{n,}

Must occur at least n times

{n}

Must occur exactly n times

*

Must occur 0 or more times (same as {0,})

+

Must occur 1 or more times (same as {1,})

?

Must occur 0 or 1 time (same as {0,1})


To understand quantifiers and their applications to address validation, let's consider the pattern



/(@.*@)/

This pattern won't be matched unless a string contains more than one "@" symbol. First it starts with the "@" symbol. The period "." represents any character. Together with the quantifier "*," you have a pattern that will match any string with two "@" symbols somewhere, whether or not they are together or are separated by one or more characters.

This provides a powerful tool for searching and eliminating email addresses with more than one "@" symbol in the address string. To see how quantifiers work, let's look at the following function:



Listing: ex13-03.txt

1: function isEmail(valSt)
2: {
3:  var patSt = /.+@/
4:  var patSt2 = /(@.*@)/
5:  return( patSt.test(valSt) && ! patSt2.test(valSt) )
6: }

This function has two match patterns. The first pattern patSt defines a positive format for an email address. This means email addresses should have the "@" symbol after some characters. The second pattern patSt2 defines a negative format. This format represents some common errors that the user may make when writing an email address. The return statement in line 5 will return true only if the input string valSt matches the first pattern (patSt) and not the second one (patSt2). This function can be used to correctly identify the following types of common malformed addresses:

@pwt-ex.com

(@ cannot be the first character)

JohnSmith.pwt-ex.com

(missing the @ symbol)

John@Smith@pwt-ex.com

(more than one @ symbol)


Of course malformed addresses come in different forms. You need more match patterns to eliminate the malformed addresses. Consider the following:

/(\.\.)/

matches

JohnSmith@pwt-ex..com

/(@\.)/

matches

JohnSmith@.pwt-ex.com

/(^\.)/

matches

.JohnSmith@pwt-ex.com


As discussed, a period or full stop symbol in pattern matching matches any character. If a special character such as a period "." needs to be matched, you simply put a backslash before the special character as shown in the patterns above. A caret (or hat) symbol "^" is used to instruct the match engine to search at the beginning. The following example page illustrates how to put these patterns into action.



Example: ex13-07.htm - Validate Email Addresses (I)

 1: <?xml version="1.0" encoding="iso-88591"?>
 2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 3:     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 4: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 5: <head><title>Validate Email Addresses (I) -- ex1307.htm</title></head>
 6: <body style="font-family:arial;font-size:16pt;background:#000088;
 7:   color:#ffff00;text-align:center">
 8: <script>
 9: function isEmail(valSt)
10:  {
11:   var patSt = /.+@/
12:   var patSt2 = /(.*@.*@)|(\.\.)|(@\.)|(^\.)/
13:   return( patSt.test(valSt) && ! patSt2.test(valSt) )
14:  }
15:
16:  emailAdd = new Array()
17:  emailAdd[1]="JohnSmith@pwt-ex.com"
18:  emailAdd[2]="John@Smith@pwt-ex.com"
19:  emailAdd[3]="JohnSmith@.pwt-ex.com"
20:  emailAdd[4]="JohnSmith@isp..com "
21:  emailAdd[5]=".JohnSmith@pwt-ex.com"
22:  emailAdd[6]="@pwt-ex.com"
23:
24: for (ii=1;ii<emailAdd.length;ii++)
25: {
26: document.write(emailAdd[ii]+" is likely a "+isEmail(emailAdd[ii])+
27:   " Email address<br />")
28: }
29:
30: </script>
31:
32: </body>
33: </html>

With the detailed explanation above, this example is easy to understand. Basically, all malformed address formats are integrated into one pattern. The vertical bar "|" in line 12 is a conditional "OR" operator which allows us to have a single pattern to match a variety of malformed addresses. A screen shot of this page is given in Fig. 13.15.

Figure 13.15. ex13-07.htm

graphics/13fig15.jpg


To detect some malformed addresses is unfortunately only the first step in validating correct email addresses. It is not easy (if not impossible) to include all the different ways an email address can go wrong. For example, a user can accidentally put a space or a non-printable character anywhere in the email address. You need a more constructive method to accomplish this difficult task. To this end, a discussion on some special symbols and rules for pattern matching is helpful.

13.3.3 Some rules for pattern matching

As discussed earlier, pattern matching is a subject originating from the Perl language. Only a simplified version is discussed here, with the emphasis on the validation of email addresses. Let's begin with some general rules (Table 13.2); you have already seen some of them in previous sections.

Table 13.2. General rules for pattern matching

^

Matches at the beginning of the string.

For example, /^@/ matches any string beginning with @

$

Matches at the end of the string.

For example, /$@/ matches any string ending with @

( )

A pattern in parentheses matches the pattern inside

.

Matches any character.

For example, .* matches any number of don't-care characters

[ ]

A pattern in square brackets (called a list of characters) matches any one of the characters in the list


For a character list, a hyphen may be used as a range delimiter. For example, the pattern /[a-z 09]/ matches any character from a to z or 0 to 9. A caret (^) at the beginning of the list causes it to match only characters that are not in the list. For example, the pattern /[^09]/ matches any non-digit character.

Characters that have special meaning are metacharacters and they don't match themselves. Some frequently used metacharacters are



^ $ * + ? . \ | ( ) [ ] { }

To match these characters, you put a backslash in front of them. For example, \\matches a backslash and \$ matches a dollar sign. A backslash can also be used to turn an alphanumeric character into a metacharacter with special meaning. For example, \t matches a tab character, while \d matches any digit. Table 13.3 shows some of these commonly used backslash characters.

Table 13.3. Meaning of metacharacters

\n

Line feed

\r

Carriage return

\t

Tab

\v

Vertical tab

\f

Form feed

\d

A digit (same as [09])

\D

A non-digit (same as [^09])

\w

A word (alphanumeric) character (same as [a-zA-Z_09])

\W

A non-word character (same as [^a-zA-Z_09])

\s

A white space character (same as [ \t\v\n\r\f])

\S

A non-white space character (same as [^ \t\v\n\r\f])


Parentheses around a pattern, or a part of a pattern, cause the string's portion that is matched by that part to be remembered for later use. Consider the following expressions

/\d+/ and /(\d+)/

that will match as many digits as possible. However, in the latter case, the matched substring will be remembered in a special variable called backreference (back reference). Back reference is defined by \#, where # is an integer. As an example, the following pattern expression matches a string whose first two characters are also its last two characters, but in reverse order:



/^(.)(.).*\2\1$/

One popular use of this expression is to match strings such as "/**Comment**/." Note that paired parentheses are numbered by counting the parentheses from the left. The next section shows how to use these rules to match email addresses.

13.3.4 A constructive pattern to match an email address

The basic syntax of an email address can be described by the following rules. It normally consists of:

  • one or more normal characters [a-zA-Z_09] together with a combination of "_" (underscore), "." (period), and "-" (hyphen) before the "@" symbol;

  • a sequence of letters, numbers, and periods which are all valid domain or IP address characters; and

  • a period followed by a suffix of twothree letters of onethree digits at the end of the address.

In terms of matching patterns, the first rule can be represented by the expression



^\w+([\.-]?\w+)*@

That is, any email address should begin with a normal character [a-zA-Z_09] and then be followed by a combination of a single period or a hyphen and more normal characters before the "@" symbol. The second rule is very similar to the first one and can be matched by the expression



\w+([\.-]?\w+)*

and finally the end part by



(\.(([a-zA-Z]{2,3})|(\d{1,3})))+$

By combining all three expressions together, you will have a simple email syntax that can be used to determine a valid email address, i.e.,



/^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.(([a-zA-Z]{2,3})|(\d{1,3})))+$/

Consider the following page:



Example: ex13-08.htm - Validate Email Addresses (II)

 1: <?xml version="1.0" encoding="iso-88591"?>
 2: <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
 3:     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
 4: <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
 5: <head><title>Validate Email Addresses (II) -- ex1308.htm</title></head>
 6: <body style="font-family:arial;font-size:16pt;background:#000088;
 7:   color:#ffff00;text-align:center">
 8: <script>
 9: function isEmail(valSt)
10: {
11:  patSt=/^\w+([\.-]?\w+)*@\w+([\.-]?\w+)*(\.(([a-zA-Z]{2,3})|(\d{1,3})))+$/
12:  return( patSt.test(valSt))
13: }
14:
15:  emailAdd = new Array()
16:  emailAdd[1]="JohnSmith@pwt-ex.com"
17:  emailAdd[2]="John@Smith@pwt-ex.com"
18:  emailAdd[3]="JohnSmith@.pwt-ex.com"
19:  emailAdd[4]="JohnSmith@isp..com "
20:  emailAdd[5]=".JohnSmith@pwt-ex.com"
21:  emailAdd[6]="@pwt-ex.com"
22:  emailAdd[7]="John Smith@pwt-ex.com"
23:  emailAdd[8]="JohnSmith@.i sp.com"
24:  emailAdd[9]="JohnSmith@isp;com "
25:  emailAdd[10]="JohnSmith@isp,com"
26:  emailAdd[11]="JohnSmith@isp:com"
27:  emailAdd[12]="JohnSmith@isp.u"
28:  emailAdd[13]="JohnSmith@231.198.198.1"
29:
30: for (ii=1;ii<emailAdd.length;ii++)
31: {
32: document.write(emailAdd[ii]+" is likely a "+isEmail(emailAdd[ii])+
33:   " Email address<br />")
34: }
35:
36: </script>
37: </body>
38: </html>

Thanks to the constructive pattern in line 11, the isEmail() function is incredibly simple. The remaining part of this page is similar to that of example ex13-07.htm. The corresponding screen display of this page is shown in Fig. 13.16. As can be seen from Fig. 13.16, this page is capable of picking up not only the most commonly known malformed email addresses as in ex13-07.htm, but also the following common mistakes:

JohnSmith@pwt-ex.com

(A space between John and Smith)

JohnSmith@isp.com

(A space in the domain name)

JohnSmith@isp;com

(A semi-colon instead of a period)

JohnSmith@isp,com

(A comma instead of a period)

JohnSmith@isp:com

(A colon instead of a period)

JohnSmith@pwt-ex.u

(Last part of domain name contains only one letter)


Figure 13.16. ex13-08.htm

graphics/13fig16.gif


As an additional remark, any IP address inside a square bracket is also a valid email address. So the address JohnSmith@[231.198.198.1] is the same as JohnSmith@231.198.198.1. To include the square bracket in the match pattern, you can use the following:



/^\w+([\.-]?\w+)*@(\[)?\w+([\.-]?\w+)*
      (\.(([a-zA-Z]{2,3})|(\d{1,3})(\))?))+$/

This is a practical expression and can be used in your application. Again, this is by no means a complete or a perfect expression to detect all malformed addresses. It is virtually impossible to come up with a matching expression, or expressions, that would include all human errors in email addresses.

    Table of Contents

    Previous Next