Regex’s | Regular Expressions in Programming

0
1933

Today I saw a nice script which redirects mobile traffic by searching the useragent string for a list of search terms related to mobile devices. This is PHP but I’ll show you how you can use this type of “regex” in Javascript too.

<?php
$useragent=$_SERVER[‘HTTP_USER_AGENT’];
if(preg_match(‘/^1207.*|^3gso.*|^4thp.*|^501i.*|^502i.*|^503i.*|^504i.*|^505i.*|^506i.*|.*Fennec.*|^6310.*|^6590.*|^770s.*|^802s.*|.*a100.*|.*a510.*|.*a511.*|^abac.*|^acer.*|^acoo.*|^acs.*|^aiko.*|^airn.*|.*alacatel.*|^alav.*|^alca.*|^alco.*|^amoi.*|^Amoi.*|.*android.*|^anex.*|^anny.*|^anyw.*|^aptu.*|^arch.*|^argo.*|^aste.*|^asus.*|^ASUS.*|^attw.*|^au.*|^audi.*|^Audiovox.*|^AU-MIC.*|^aur.*|^aus.*|^avan.*|^beck.*|^bell.*|^benq.*|^BenQ.*|^bilb.*|^bird.*|^Bird.*|^blac.*|.*BlackBerry.*|^blaz.*|.*Blazer.*|.*boxee.*|.*BRAVIA.*|^brew.*|^brvw.*|^bumb.*|^bw.*|^c55.*|^capi.*|^ccwa.*|^cdm.*|^CDM.*|.*CE-HTML.*|^cell.*|^chtm.*|^cldc.*|^cmd.*|^comp.*|^cond.*|.*CorePlayer.*|^craw.*|^dait.*|^dall.*|^dang.*|^dbte.*|^dc.*|.*dell streak.*|^devi.*|^dica.*|.*DLNA.*|.*DLNADOC.*|^dmob.*|^doco.*|^DoCoMo.*|^dopo.*|^dopod.*|^ds.*|^ds12.*|^el49.*|^elai.*|^eml2.*|^emul.*|^eric.*|.*Ericsson.*|^erk0.*|^esl8.*|^ez40.*|^ez60.*|^ez70.*|^ezos.*|^ezwa.*|^ezze.*|^fake.*|^fetc.*|^fly.*|.*FlyCast.*|.*foobar2000.*|^g1.*|^g560.*|^gene.*|^gf.*|^go.*|.*GomPlayer.*|^good.*|.*GoogleTV.*|^grad.*|^grun.*|^haie.*|^Haier.*|.*hbbtv.*|.*HbbTV.*|^hcit.*|^hd.*|^hei.*|^hipt.*|^hita.*|^HP.*|.*htc.*|^htca.*|^htcg.*|^htcp.*|^htcs.*|^htct.*|^http.*|^huaw.*|.*Huawei.*|^hutc.*|^i230.*|^iac.*|^ibro.*|^idea.*|.*iemobile.*|^ig01.*|^ikom.*|^im1k.*|^i-mobile.*|^inno.*|.*ipad.*|^ipaq.*|.*iPAQ.*|.*iphone.*|.*iPod.*|^iris.*|.*iTunes.*|^jata.*|^java.*|^jbro.*|^jemu.*|^jigs.*|^kddi.*|^KDDI.*|^keji.*|^kgt.*|.*kindle.*|^klon.*|^KONKA.*|^kpt.*|^kwc.*|^KWC.*|^kyoc.*|^kyok.*|.*Large Screen.*|^leno.*|^Lenovo.*|^lexi.*|^lg.*|^lg50.*|^lg54.*|^lge.*|^libw.*|^lynx.*|^m3ga.*|^m50.*|^mate.*|^maui.*|^maxo.*|^mc01.*|^mc21.*|^mcca.*|^medi.*|^merc.*|^meri.*|^midp.*|.*midp.*|.*mini.*|^mio8.*|^mioa.*|.*Miro.*|^mits.*|^mmef.*|^mo01.*|^mo02.*|^mobi.*|.*mobile.*|^mode.*|^modo.*|^mot.*|^motv.*|^mozz.*|.*MPlayer.*|.*MSN.*|^mt50.*|^mtp1.*|^mtv.*|^mwbp.*|^mywa.*|^n100.*|^n101.*|^n102.*|^n202.*|^n203.*|^n300.*|^n302.*|^n500.*|^n502.*|^n505.*|^n700.*|^n701.*|^n710.*|^nec.*|^NEC-.*|^nem.*|^neon.*|^netf.*|.*NETTV.*|^newg.*|^NEWGEN.*|^newt.*|.*Nexus 10.*|.*Nexus 7.*|.*Nintendo.*|^nok6.*|^noki.*|.*Nokia.*|.*Novarra.*|^nzph.*|^o2.*|.*o2.*|.*O2.*|^o2im.*|.*Opera.Mobi.*|^opti.*|^opwv.*|^oran.*|^owg1.*|^p800.*|.*Palm.*|^pana.*|^Panasonic.*|^pand.*|^pant.*|^PANTECH.*|^pdxg.*|^PG.*|^pg13.*|^phil.*|^Philips.*|^pire.*|^play.*|.*PLAYSTATION 3.*|.*Plex.*|^pluc.*|^pock.*|.*pocket.*|^port.*|^portalmmm.*|^pose.*|^PPC.*|^prox.*|.*PS3.*|^psio.*|.*psp.*|^qc07.*|^qc12.*|^qc21.*|^qc32.*|^qc60.*|^qci.*|^qtek.*|^Qtek.*|.*QuickTime.*|^qwap.*|^r380.*|^r600.*|^raks.*|^rim9.*|^rove.*|^rozo.*|^s55.*|^sage.*|^Sagem.*|^SAGEM.*|^sama.*|^samm.*|^sams.*|.*SAMSUNG.*|^sany.*|.*Sanyo.*|^sava.*|^sc01.*|^sch.*|^SCH.*|.*SCH-.*|.*sch-i800.*|^scoo.*|^scp.*|^sdk.*|^se47.*|^sec.*|^SEC.*|^sec0.*|^sec1.*|^semc.*|.*SEMC-Browser.*|^send.*|^Sendo.*|^seri.*|^sgh.*|^SGH.*|.*SGH-.*|.*sgh-t849.*|^shar.*|^Sharp.*|.*shw-m180s.*|^sie.*|^SIE.*|^siem.*|^SIEMENS.*|.*silk.*|^sl45.*|^slid.*|^smal.*|^smar.*|.*Smarthub.*|.*smartphone.*|.*SmartTV.*|.*SMART-TV.*|^smb3.*|^smit.*|^smt5.*|^soft.*|^SoftBank.*|^sony.*|^SonyEricsson|^SonyEricsson.*|.*SonyEricsson.*|^sp01.*|^sph.*|^SPH.*|^spv.*|^sy01.*|^symb.*|.*symbian.*|.*SymbianOS.*|^t218.*|^t250.*|^t600.*|^t610.*|^t618.*|.*tablet.*|^tagt.*|^talk.*|^tcl.*|^tdg.*|.*teleca.*|^teli.*|^telm.*|^tim.*|^topl.*|^tosh.*|.*Toshiba.*|.*treo.*|^ts70.*|^tsm.*|^tsm3.*|^tsm5.*|.*up\.browser.*|^upg1.*|.*up\.link.*|.*UPnP.*|^upsi.*|^UTS.*|^utst.*|^v400.*|^v750.*|^veri.*|^Vertu.*|^virg.*|^vite.*|^vk40.*|^vk50.*|^vk52.*|^vk53.*|.*VLC media player.*|^vm40.*|^voda.*|.*vodafone.*|^vulc.*|^vx52.*|^vx53.*|^vx60.*|^vx61.*|^vx70.*|^vx80.*|^vx81.*|^vx83.*|^vx85.*|^vx98.*|^w3c.*|.*WAFA.*|^wap.*|.*wap.*|^wapa.*|^wapi.*|^wapj.*|^wapm.*|^wapp.*|^wapr.*|^waps.*|^wapt.*|^wapu.*|^wapv.*|^wapy.*|^webc.*|.*webOS.*|.*WebTV.*|^whit.*|.*BOLT.*|^wig.*|.*wii.*|^winc.*|.*windows ce.*|.*Windows.CE.*|.*Windows-Media-Player.*|.*WindowsPhone.*|.*Windows Phone.*|^winw.*|^wmlb.*|^wonu.*|^x700.*|.*XBMC.*|.*xbox.*|^xda.*|.*Xda.*|^xda2.*|^xdag.*|^yas.*|^your.*|^zeto.*|^ZTE.*/i’, $useragent)) {
    header(‘Location: http://mobile.mysite.com’);
}
?>

So the code above will redirect mobile devices to the url mobile.mysite.com

The interesting line of code is the one with the long search term but if we cut that out and take a look you can see what it is doing.

So the regex is contained within two / / symbols the i after the final symbol means the search is case-insensitve. $useragent is the variable which we are searching for words1&2 which are split by a | symbol. The ^ symbol is used to anchor the search to the start so it will only match when the useragent is “word1 rest of string” and wont match “hello word1 rest of string”. The dot star at the end of a regex means that anything can come afterwards.

You can use regex’s in most programming languages. Here is a simple if condition in Javascript:

if(navigator.userAgent.match(/^word1.*|^word2.*/i)) {

Regexs can be used to build complex if statements, search and replace functions, data extraction etc.

regex

One of my favorites is:

$clean = preg_replace(“/[^a-zA-Z0-9]/”,””,$raw);

Which removes any bad characters from a user inputed string which is useful for indexing and filenames.

Or in a scraping situation you could extract all the links from a html document with the following PHP code

function extract_links($text) {
  preg_match_all(‘/<\s*a[^<>]*?href=[\'”]?([^\s<>\'”]*)[\'”]?[^<>]*>(.*?)<\/a>/si’,
    $text,
    $match_array,
    PREG_SET_ORDER);
  $return = array() ;
  foreach ($match_array as $serp) {
    $full_anchor = $serp[0];
    $href = $serp[1];
    $anchortext = $serp[2];
    $anchor_array = array($href,$anchortext) ;
    array_push($return,$anchor_array) ;
  }
  return $return ;
}

To find out more google regex tutorial or there is a cheat sheet here: http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

One word of note is that often regex performance isn’t great so if you are looking at doing something at really high volumes which is performance sensitive consider doing a more direct search.

LEAVE A REPLY

Please enter your comment!
Please enter your name here