《Mastering Regular Expressions》---学习笔记
- 格式:pdf
- 大小:248.43 KB
- 文档页数:13
一、正则表达式简介在计算机科学中,是指一个用来描述或者匹配一系列符合某个句法规则的字符串的单个字符串。
在很多文本编辑器或其他工具里,正则表达式通常被用来检索和/或替换那些符合某个模式的文本内容。
许多程序设计语言都支持利用正则表达式进行字符串操作。
例如,在Perl中就内建了一个功能强大的正则表达式引擎。
正则表达式这个概念最初是由Unix中的工具软件(例如sed和grep)普及开的。
正则表达式通常缩写成“regex”,单数有regexp、regex,复数有regexps、regexes、regexen。
编辑本段二、正则表达式的历史和起源正则表达式的“鼻祖”或许可一直追溯到科学家对人类神经系统工作原理的早期研究。
美国新泽西州的Warren McCulloch和出生在美国底特律的Walter Pitts这两位神经生理方面的科学家,研究出了一种用数学方式来描述神经网络的新方法,他们创新地将神经系统中的神经元描述成了小而简单的自动控制元,从而作出了一项伟大的工作革新。
在1956 年,出生在被马克·吐温(Mark Twain)称为“美国最美丽的城市之一的”哈特福德市的一位名叫Stephen Kleene的数学科学家,他在Warren McCulloch和Walter Pitts早期工作的基础之上,发表了一篇题目是《神经网事件的表示法》的论文,利用称之为正则集合的数学符号来描述此模型,引入了正则表达式的概念。
正则表达式被作为用来描述其称之为“正则集的代数”的一种表达式,因而采用了“正则表达式”这个术语。
之后一段时间,人们发现可以将这一工作成果应用于其他方面。
Ken Thompson就把这一成果应用于计算搜索算法的一些早期研究,Ken Thompson是 Unix的主要发明人,也就是大名鼎鼎的Unix之父。
Unix之父将此符号系统引入编辑器QED,然后是Unix上的编辑器ed,并最终引入grep。
Jeffrey Friedl 在其著作“Mastering Regular Expressions (2nd edition)”中对此作了进一步阐述讲解,如果你希望更多了解正则表达式理论和历史,推荐你看看这本书。
regular expressions英文解释全文共6篇示例,供读者参考篇1Title: Cracking the Code: Unraveling the Magic of Regular ExpressionsHave you ever wondered how computers can quickly find specific words or patterns within huge chunks of text? Well, get ready to dive into the fascinating world of regular expressions! These little codes are like secret spells that help computers search, match, and manipulate text with incredible speed and accuracy.Regular expressions, often abbreviated as "regex" or "regexp," are a special language that computers understand. They're like a set of instructions that tell the computer exactly what to look for in a piece of text. Imagine you have a massive book, and you want to find all the sentences that mention a certain word or phrase. Instead of reading through the entire book word by word, you could use a regular expression to tell the computer precisely what to search for, and it would do the job in a flash!Now, let's break down how regular expressions work. Think of them as a series of characters, each with its own special meaning. For example, the character "a" in a regular expression would match the letter "a" in the text you're searching. But regular expressions go beyond simple letters; they can also match numbers, symbols, and even patterns of characters.One of the coolest things about regular expressions is that they can use special characters called "metacharacters" to match different types of patterns. For instance, the metacharacter "." can match any single character except a newline. So, if you wanted to find all the words that start with "c" and have exactly three letters, you could use the regular expression "c.." to find words like "cat," "car," or "cup."Regular expressions can also use repetition characters to match patterns that repeat a certain number of times. The "+" character matches one or more occurrences of the preceding pattern, while the "*" character matches zero or more occurrences. For example, the regular expression "a+b" would match "ab," "aab," "aaab," and so on, as long as there's at least one "a" before the "b."But wait, there's more! Regular expressions can also use character classes to match specific sets of characters. Forinstance, the character class "[0-9]" would match any digit from 0 to 9, while "[a-z]" would match any lowercase letter from a to z.Now, you might be thinking, "This all sounds great, but how do I actually use regular expressions?" Well, many programming languages and text editors support regular expressions, and you can often find them in search and replace functions, too. For example, if you wanted to find all the email addresses in a document, you could use a regular expression like"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+.[A-Za-z]{2,}\b" to match patterns that look like valid email addresses.Regular expressions can be incredibly powerful, but they can also be a bit tricky to learn at first. It's like learning a new language, but once you get the hang of it, you'll be able to perform all sorts of amazing text manipulation feats!So, the next time you're working with text, whether it's in a programming language, a text editor, or even a search engine, keep an eye out for regular expressions. They're like little wizards that can help you find, match, and manipulate text in ways you never thought possible. Who knows, you might even become a regular expression master yourself one day!篇2Title: What are Regular Expressions? A Fun Explanation!Have you ever played a game where you had to find hidden words or patterns? Well, regular expressions are like that, but for computers! They are special codes that help computers search for and identify specific patterns in text. Pretty cool, right?Regular expressions might sound complicated, but they are actually a lot of fun once you get the hang of them. Imagine you have a big pile of books, and you need to find all the books that have the word "cat" in the title. You could look through each book one by one, but that would take forever! With regular expressions, you can tell the computer to look for the pattern "cat" and it will quickly find all the books that match.But regular expressions can do much more than just find simple words. They can look for all sorts of patterns, like numbers, dates, email addresses, and even complicated combinations of letters and symbols. It's like having asuper-powered search tool that can find almost anything you want!Now, let's talk about how regular expressions work. They use a special language made up of different characters and symbols. Each symbol has a specific meaning, and when you put themtogether in a certain order, it creates a pattern that the computer can understand.Here are some of the most common symbols used in regular expressions:The dot (.) This symbol matches any single character, except for a new line. So, if you search for ".at", it will match words like "cat", "bat", "hat", and even "1at" or "?at".The asterisk () This symbol matches the previous character or pattern zero or more times. For example, if you search for "ab", it will match "b", "ab", "aab", "aaab", and so on.The plus sign (+) This symbol matches the previous character or pattern one or more times. So, if you search for "a+b", it will match "ab", "aab", "aaab", but not "b".Square brackets ([]) These brackets allow you to specify a set of characters that you want to match. For instance, if you search for "[bc]at", it will match "bat" and "cat".The caret (^) When used inside square brackets, this symbol negates the set of characters. So, if you search for "[^a]at", it will match "bat", "cat", "1at", but not "aat".The dollar sign () This symbol matches the end of a line or string. If you search for "cat", it will match "cat" at the end of a sentence, but not "catch" or "cats".Those are just a few examples of the symbols used in regular expressions. There are many more, and you can combine them in countless ways to create incredibly complex patterns.Now, you might be thinking, "But why would I need to use regular expressions? Can't I just search for words normally?" Well, regular expressions are incredibly useful for all sorts of tasks, like:Validating user input in forms (making sure an email address is formatted correctly, for example)Finding and replacing text in documents or codeExtracting specific information from large data setsSearching for patterns in log files or other system dataRegular expressions are used everywhere, from websites and apps to programming languages and databases. They are a powerful tool that can save you a lot of time and effort when working with text.Learning regular expressions might seem a bit tricky at first, but don't worry! It's like learning a new language – the more youpractice, the easier it gets. And once you've mastered the basics, you'll be amazed at how much you can do with just a few symbols and characters.So, the next time you're playing a word game or trying to find a specific pattern, remember that regular expressions are like secret codes that can help you unlock all sorts of hidden treasures in text. Happy pattern-hunting!篇3Regular Expressions: A Fun Way to Find Patterns in Words and Numbers!Have you ever played word games like trying to find all the words that start with 'S' or end with 'ing'? Regular expressions are like super-powered word games that let you search for really cool patterns in words, numbers, and even weird codes called text! They're kind of like secret codes that only computers understand. But don't worry, I'll teach you all about them in a fun way!Imagine you have a bunch of words written down, and you want to find all the words that start with 're'. You could go through the list one by one, looking at the first two letters of each word. That would take forever! With regular expressions,you can tell the computer, "Hey, look for any word that starts with 're'," and it will find them all for you in a flash. Isn't that awesome?Regular expressions use special symbols and letters to describe the patterns you want to find. For example, the regular expression/re\w*/(don't worry about that slash for now) says, "Find any word that starts with 're' and has zero or more word characters after it." The\w*part means "some word characters."But regular expressions can do way more than just find words that start with certain letters. You can look for numbers, too! The regular expression/\d\d\d/means "Find any three-digit number." The\dsymbol stands for "a digit, like 0 through 9."You can also look for words that contain certain patterns in the middle or end. The regular expression/ing\b/means "Find any word that ends with 'ing'." The\bsymbol stands for "a word boundary," which means the end of a word.Here are some more fun examples of regular expressions:/^super/finds words that start with "super"/!/finds things that end with an exclamation point/\d\d\/\d\d\/\d\d\d\d/finds dates in the format "MM/DD/YYYY"/\w*@\w*\.\w*/finds email addresses (like******************)Isn't that wild? With just a few symbols, you can find all sorts of crazy patterns!But wait, there's more! You can also use regular expressions to replace parts of words or sentences with new text. Let's say you have a sentence that says "I really really really like ice cream." You could use the regular expression/really/gto find all the "really" words, and replace them with "totally" to get "I totally totally totally like ice cream." The/gpart means "global," so it replaces all matches, not just the first one.Regular expressions are super handy for finding and fixing mistakes in long documents, extracting information from messy data, and all sorts of other tasks that would take forever to do by hand.I know regular expressions look a bit weird and confusing at first, but once you start playing around with them, they're really fun! You can think of them like secret codes that let you talk to computers in a special language. Pretty cool, right?So next time you're looking for words or patterns, don't go through them one-by-one. Use the power of regular expressions to save time and make your searches way more awesome! Who knows, you might even discover some hidden patterns that no one else has noticed before. Regular expressions give you x-ray vision for words and numbers. How awesome is that?篇4Regular Expressions Explained for KidsHave you ever played a game where you had to find certain words or patterns in a bunch of letters or numbers? That's kind of what regular expressions are all about! Regular expressions are like secret codes that help computers find specific patterns in text.Imagine you're playing a game with your friends, and you're trying to find all the words that start with the letter "S". You could look through the whole list of words, one by one, and check if each word starts with "S". But that would take forever, right? That's where regular expressions come in handy!A regular expression is like a special set of instructions that tells the computer exactly what to look for. In our game, the regular expression to find words starting with "S" would be: ^SThe little hat symbol "^" means "start of the word", and the "S" is the letter we're looking for at the start of the word. So this regular expression tells the computer, "Find all the words that start with the letter 'S'."Now, let's say we want to find all the words that end with "ing". The regular expression for that would be: ingThe dollar sign "" means "end of the word", and "ing" is the part we're looking for at the end of the word.Cool, right? But regular expressions can do even more tricks!What if we want to find all the words that have exactly three letters? The regular expression for that would be: ^...The dots "." are placeholders that can represent any letter. So "^..." means "a word that starts with any letter, followed by any two other letters, and then the end of the word."You can also use regular expressions to find words that contain certain patterns. For example, if we want to find all the words that have the letter "a" followed by the letter "b", the regular expression would be: abSimple, isn't it?Here's another example: let's say we want to find all the words that have either "cat" or "dog" in them. The regular expression for that would be: cat|dogThe vertical bar "|" means "or", so this regular expression tells the computer to look for words that contain either "cat" or "dog".Regular expressions can get pretty complicated, but even simple ones like these can be super useful for finding patterns in text. Programmers use them all the time to search through large amounts of data, like when they're looking for specific words or phrases in a big document or a bunch of websites.You can think of regular expressions as a special language that helps computers understand what we're looking for. Just like you learn different languages to communicate with people from other countries, computers need regular expressions to understand the patterns we want them to find.And the best part? Regular expressions work the same way in almost every programming language, so once you learn them, you can use them pretty much anywhere!So next time you're playing a game that involves finding words or patterns, remember regular expressions. They're likesecret codes that can make the game a whole lot easier – and maybe even help you win!篇5Regular Expressions: The Super Cool Pattern Finders!Hey there, fellow kids! Have you ever played those fun word games where you have to find hidden words or patterns? Well, let me tell you about something even cooler – regular expressions! Okay, I know the name might sound a bit boring, but trust me, these things are like magic wands for text.Imagine you have a bunch of text, like a really long story or a list of names and addresses. Sometimes, you might need to find specific patterns or pieces of information in all that text. That's where regular expressions come in! They're like special codes that help you search for and manipulate text in amazing ways.Let's start with a simple example. Say you're looking for all the words in a story that start with the letter "s". With regular expressions, you can create a pattern like this:/\bs\w*/. That might look like gibberish, but it's actually a secret code that tells the computer to find all the words that start with "s". The/and/symbols are just markers to tell the computer where the pattern starts and ends. The\bmeans "word boundary," so it looks for words that start with "s" rather than letters in the middle of words. The\w*part means "match any word character (letters, numbers, or underscores) zero or more times."Cool, right? But that's just the tip of the iceberg! Regular expressions can do all sorts of awesome things. You can use them to find email addresses, phone numbers, dates, and even more complex patterns. It's like having a super-powered search tool!Here's another example: let's say you want to find all the email addresses in a list of contacts. You could use a regular expression like this:/\b\w+@\w+\.\w+\b/. This pattern looks for words that have an "@" symbol followed by some more letters, a period, and then even more letters. Voila! You've got all the email addresses!Now, I know what you're thinking: "But wait, those regular expression things look so complicated!" And you're right, they can get pretty crazy. But don't worry, you don't have to memorize all the symbols and patterns. There are lots of handy tools and cheatsheets that can help you build regular expressions, and with a bit of practice, you'll be a pro in no time!One of the coolest things about regular expressions is that you can use them in all sorts of programming languages and text editors. Whether you're writing code, searching through files, or even just trying to find and replace text in a document, regular expressions can be your best friend.And here's a little secret: regular expressions are like a secret code that can make you look like a total genius to your friends and teachers. Imagine being able to find all the words in a bookthat have more than three vowels, or all the phone numbers in a directory that start with a specific area code. With regular expressions, you can do all that and more!So, what do you say? Are you ready to become a regular expression master and impress everyone with yourtext-manipulating superpowers? It might take a little practice, but trust me, it's totally worth it. Who knows, you might even end up creating the next big word game or text-based app using these awesome pattern-finding tools!篇6What are Regular Expressions?Have you ever played a game where you had to find hidden words or patterns? Regular expressions are like that, but for computers! They are special codes that help computers search for specific patterns in text.Regular expressions might seem complicated at first, but they are actually quite fun once you get the hang of them. Think of them as a secret language that you can use to tell the computer exactly what to look for.Let's start with an example. Imagine you want to find all the words in a sentence that start with the letter "c". You could use the regular expression "/c\w*/". This code tells the computer to look for any word that starts with the letter "c" and has zero or more letters following it.Breaking it down:The "/" and "/" symbols are used to mark the start and end of the regular expression.The "c" tells the computer to look for the letter "c".The "\w*" part means "match any word character (letters, numbers, or underscores) zero or more times".So, if you had the sentence "Cats and dogs chase mice," the regular expression "/c\w*/" would match "Cats" and "chase".Building Regular ExpressionsRegular expressions are like little puzzles made up of different pieces. Each piece represents a different pattern or rule that the computer should look for. Here are some common pieces:Literal Characters: These are just regular letters or numbers that the computer should look for exactly as they appear. For example, "cat" will match the word "cat" but not "Cat" or "CAT".Character Classes: These are special codes that represent groups of characters. For example, "\d" matches any digit (0-9), and "\w" matches any word character (letters, numbers, or underscores).Quantifiers: These tell the computer how many times a pattern should repeat. For example, "a+" matches one or more occurrences of the letter "a", and "b?" matches zero or one occurrence of the letter "b".Anchors: These help specify the position of the pattern within the text. For example, "^start" matches any text that starts with the word "start", and "end" matches any text that ends with the word "end".You can combine these pieces in endless ways to create complex patterns. It's like building with Lego blocks, but instead of making spaceships or castles, you're creating instructions for the computer to follow.Using Regular ExpressionsRegular expressions are incredibly useful for all sorts of tasks. For example, you can use them to:Find and replace text in a documentValidate user input (like making sure an email address is correctly formatted)Extract specific information from large data setsSearch for patterns in code or log filesMany programming languages and tools support regular expressions, so you can use them in all kinds of applications and scripts.Learning regular expressions might seem tricky at first, but the more you practice, the easier it becomes. It's like learning a new language or solving puzzles – the more you do it, the better you get!So, the next time you need to find a hidden pattern or perform a complex search, remember regular expressions. They are a powerful tool that can make your life (and the computer's life) much easier.。
正则表达式学习总结、知识点记录正则表达式(Regular Expression),它是用一个“字符串”定义一种“模式”,然后把它应用到另一个“字符串”中用以寻找是否有与此“模式”相匹配的字符。
应用原则和你需要知道的:1、清楚的了解目标文本是正确使用RE的根本前提。
也就是说,从已知文本中提取数据与从随机文本中提取数据根本不是一回事。
2、如果某个RE基本不可能匹配到不期望的结果,使用它就是合理的。
3、RE的构建复杂度取决于你想要得到多么精准的结果,什么是最合适的解决方案取决于你可以接受的精确度、效率、对错误的容忍程度。
4、平衡法则(好的RE必须在以下方面求得平衡):1.只匹配期望的文本,排除不期望的。
2.易于控制和理解。
3.保证效率。
如果能够匹配必须很快的返回结果;如果不能应在尽可能短的时间内报告失败。
5、别忘了时常想想匹配失败的情形。
6、验证某个模式能不能获得预期的匹配结果并不困难,但如何验证它不会匹配到你不想要的东西可就没那么简单了。
也就是说把不需要匹配的情况也考虑周全并确保它们都将被排除在外往往十分困难。
7、不应该忘记考虑这样的“特殊”情形,针对“糟糕”的数据,RE不应该能够匹配。
引擎构造与基本工作机制:字符、元字符、字符组、量词、括号的组合方式决定了引擎的特性。
有两种类型的引擎:文本导向(text-directed)的DFA引擎和正则导向(regex-directed)的NFA引擎。
它们全称是:确定型有穷自动机、非确定型有穷自动机。
其中NFA又分为:传统型NFA和POSIX NFA。
本文总结的是传统NFA的引擎。
这是因为一些非常有用的特性,如:回溯(backtrack)、捕获括号(capture brace)、环视(look around)、忽略优先量词(lazy quantifiers)、反向引用(back references)、占有优先量词(possessive quantifiers)、固化分组(atomic group),只能在正则导向的引擎中实现。
第1章正则表达式匹配器Brian Kernighan正则表达式是描述文本模式的表示法,它可以有效地构造一种用于模式匹配的专用语言。
虽然正则表达式可以有多种不同的形式,但它们都有着共同的特点:模式中的大多数字符都是匹配字符串中的字符本身,但有些元字符(metacharacter)却有着特定的含义,例如*表示某种重复,而[...]表示方括号中字符集合的任何一个字符。
实际上,在文本编辑器之类的程序中,所执行的查找操作都是查找文字,因此正则表达式通常是像“print”之类的字符串,而这类字符串将与文档中所有的“printf”或者“sprintf”或者“printer paper”相匹配。
在Unix和Windows中可以使用所谓的通配符来指定文件名,其中字符*可以用来匹配任意数量的字符,因此匹配模式*.c就将匹配所有以.c结尾的文件。
此外,还有许许多多不同形式的正则表达式,甚至在有些情况下,这些正则表达式会被认为都是相同的。
Jeffrey Friedl编著的《Mastering Regular Expressions》一书对这一方面问题进行了广泛的研究。
Stephen Kleene在20世纪50年代的中期发明了正则表达式,用来作为有限自动机的表示法,事实上,正则表达式与其所表示的有限自动机是等价的。
20世纪60年代年代中期,正则表达式最初出现在Ken Thompson版本的QED文本编辑器的程序设置中。
1967年Thompson申请了一项基于正则表达式的快速文本匹配机制的专利。
这项专利在1971年获得了批准,它是最早的软件专利之一[U.S. Patent 3,568,156, Text Matching Algorithm, March 2, 1971].后来,正则表达式技术从QED移植到了Unix的编辑器ed中,然后又被移植到经典的Unix工具grep中,而grpe正是由于Thompson对ed进行了彻底地修改而形成的。
正则表达式(Regularexpressions)使⽤笔记Regular expressions are a powerful language for matching text patterns. This page gives a basic introduction to regularexpressions themselves sufficient for our Python exercises and shows how regular expressions work in Python. ThePython "re" module provides regular expression support.In Python a regular expression search is typically written as:match = re.search(pat, str)The re.search() method takes a regular expression pattern and a string and searches for that pattern within the string. Ifthe search is successful, search() returns a match object or None otherwise. Therefore, the search is usually immediatelyfollowed by an if-statement to test if the search succeeded, as shown in the following example which searches for thepattern 'word:' followed by a 3 letter word (details below):str = 'an example word:cat!!'match = re.search(r'word:\w\w\w', str)# If-statement after search() tests if it succeededif match:print 'found', match.group() ## 'found word:cat'else:print 'did not find'The code match = re.search(pat, str) stores the search result in a variable named "match". Then the if-statement tests thematch -- if true the search succeeded and match.group() is the matching text (e.g. 'word:cat'). Otherwise if the match isfalse (None to be more specific), then the search did not succeed, and there is no matching text.The 'r' at the start of the pattern string designates a python "raw" string which passes through backslashes withoutchange which is very handy for regular expressions (Java needs this feature badly!). I recommend that you always writepattern strings with the 'r' just as a habit.Note: match.group() returns a string of matched expression(type:str)Basic PatternsThe power of regular expressions is that they can specify patterns, not just fixed characters. Here are the most basicpatterns which match single chars:a, X, 9, < -- ordinary characters just match themselves exactly. The meta-characters which do not match themselves because they have special meanings are: . ^ $ * + ? { [ ] \ | (). (a period) -- matches any single character except newline '\n'\w -- (lowercase w) matches a "word" character: a letter or digit or underbar [a-zA-Z0-9_]. Note that although "word" is the mnemonic for this, it only matches a single word char, not a whole word. \W (upper case W) matches any non-word character.\b -- boundary between word and non-word\s -- (lowercase s) matches a single whitespace character -- space, newline, return, tab, form [ \n\r\t\f]. \S (upper case S) matches any non-whitespace character.\t, \n, \r -- tab, newline, return\d -- decimal digit [0-9]^ = start, $ = end -- match the start or end of the string\ -- inhibit the "specialness" of a character. So, for example, use \. to match a period or \\ to match a slash. If you are unsure if a character has special meaning, such as '@', you can put a slash in front of it, @, to make sure it is treated just as a character.Basic FeaturesThe basic rules of regular expression search for a pattern within a string are:The search proceeds through the string from start to end, stopping at the first match foundAll of the pattern must be matched, but not all of the stringIf match = re.search(pat, str) is successful, match is not None and in particular match.group() is the matching text RepetitionThings get more interesting when you use + and * to specify repetition in the pattern+ -- 1 or more occurrences of the pattern to its left, e.g. 'i+' = one or more i's'*' -- 0 or more occurrences of the pattern to its left? -- match 0 or 1 occurrences of the pattern to its leftLeftmost & LargestFirst the search finds the leftmost match for the pattern, and second it tries to use up as much of the string as possible --i.e. + and * go as far as possible (the + and * are said to be "greedy").## i+ = one or more i's, as many as possible.match = re.search(r'pi+', 'piiig') => found, match.group() == "piii"## Finds the first/leftmost solution, and within it drives the +## as far as possible (aka 'leftmost and largest').## In this example, note that it does not get to the second set of i's.match = re.search(r'i+', 'piigiiii') => found, match.group() == "ii"## \s* = zero or more whitespace chars## Here look for 3 digits, possibly separated by whitespace.match = re.search(r'\d\s*\d\s*\d', 'xx1 2 3xx') => found, match.group() == "1 2 3"match = re.search(r'\d\s*\d\s*\d', 'xx12 3xx') => found, match.group() == "12 3"match = re.search(r'\d\s*\d\s*\d', 'xx123xx') => found, match.group() == "123"## ^ = matches the start of string, so this fails:match = re.search(r'^b\w+', 'foobar') => not found, match == None## but without the ^ it succeeds:match = re.search(r'b\w+', 'foobar') => found, match.group() == "bar"Emails ExampleSuppose you want to find the email address inside the string 'xyz alice-b@ purple monkey'. We'll use this as a running example to demonstrate more regular expression features. Here's an attempt using the pattern r'\w+@\w+':str = 'purple alice-b@ monkey dishwasher'match = re.search(r'\w+@\w+', str)if match:print match.group() ## 'b@google'The search does not get the whole email address in this case because the \w does not match the '-' or '.' in the address. We'll fix this using the regular expression features below.Square BracketsSquare brackets can be used to indicate a set of chars, so [abc] matches 'a' or 'b' or 'c'. The codes \w, \s etc. work insidesquare brackets too with the one exception that dot (.) just means a literal dot. For the emails problem, the square brackets are an easy way to add '.' and '-' to the set of chars which can appear around the @ with the pattern r'[\w.-]+@[\w.-]+' to get the whole email address:match = re.search(r'[\w.-]+@[\w.-]+', str)if match:print match.group() ## 'alice-b@'You can also use a dash to indicate a range, so1. [a-z] matches all lowercase letters.2. To use a dash without indicating a range, put the dash last, e.g. [abc-].3. An up-hat (^) at the start of a square-bracket set inverts it, so [^ab] means any char except 'a' or 'b'.Group ExtractionThe "group" feature of a regular expression allows you to pick out parts of the matching text. Suppose for the emails problem that we want to extract the username and host separately. To do this, add parenthesis ( ) around the username and host in the pattern, like this: r'([\w.-]+)@([\w.-]+)'. In this case, the parenthesis do not change what the pattern will match, instead they establish logical "groups" inside of the match text. On a successful search, match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesis. The plain match.group() is still the whole match text as usual.str = 'purple alice-b@ monkey dishwasher'match = re.search('([\w.-]+)@([\w.-]+)', str)if match:print match.group() ## 'alice-b@' (the whole match)print match.group(1) ## 'alice-b' (the username, group 1)print match.group(2) ## '' (the host, group 2)A common workflow(⼯作流程) with regular expressions is that you write a pattern for the thing you are looking for, adding parenthesis groups to extract the parts you want.Note: match.group(1) is the match text corresponding to the 1st left parenthesis, and match.group(2) is the text corresponding to the 2nd left parenthesisfindallfindall() is probably the single most powerful function in the re module. Above we used re.search() to find the first match for a pattern. findall() finds all the matches and returns them as a list of strings(list), with each string representing one match.## Suppose we have a text with many email addressesstr = 'purple alice@, blah monkey bob@ blah dishwasher'## Here re.findall() returns a list of all the found email stringsemails = re.findall(r'[\w\.-]+@[\w\.-]+', str) ## ['alice@', 'bob@']for email in emails:# do something with each found email stringprint emailfindall With FilesFor files, you may be in the habit of writing a loop to iterate over the lines of the file, and you could then call findall() on each line. Instead, let findall() do the iteration for you -- much better! Just feed the whole file text into findall() and let it return a list of all the matches in a single step (recall that f.read() returns the whole text of a file in a single string):# Open filef = open('test.txt', 'r')# Feed the file text into findall(); it returns a list of all the found stringsstrings = re.findall(r'some pattern', f.read())findall and GroupsThe parenthesis ( ) group mechanism can be combined with findall(). If the pattern includes 2 or more parenthesis groups, then instead of returning a list of strings, findall() returns a list of tuples. Each tuple represents one match of the pattern, and inside the tuple is the group(1), group(2) .. data. So if 2 parenthesis groups are added to the email pattern, then findall() returns a list of tuples, each length 2 containing the username and host, e.g. ('alice', '').str = 'purple alice@, blah monkey bob@ blah dishwasher'tuples = re.findall(r'([\w\.-]+)@([\w\.-]+)', str)print tuples ## [('alice', ''), ('bob', '')]for tuple in tuples:print tuple[0] ## usernameprint tuple[1] ## hostOnce you have the list of tuples, you can loop over it to do some computation for each tuple. If the pattern includes no parenthesis, then findall() returns a list of found strings as in earlier examples. If the pattern includes a single set of parenthesis, then findall() returns a list of strings corresponding to that single group.Obscure optional feature:Sometimes you have paren ( ) groupings in the pattern, but which you do not want to extract. In that case, write the parens with a ?: at the start, e.g. (?: ) and that left paren will not count as a group result.Reference:1.2.Thanks!<完>。
正则表达式范围正则表达式是一种用来匹配、搜索、替换字符串的强大工具。
在正则表达式中,可以使用范围来定义字符集合,以便更精确地匹配目标字符串。
在正则表达式中,范围可以通过使用字符类(character class)来实现。
字符类用方括号 [] 包围,可以指定希望匹配的字符范围。
范围可以使用连字符 - 来表示。
下面是一些常见的范围写法及其含义:1. 数字范围:使用 [0-9] 表示匹配任意一个数字。
例如,正则表达式 [0-9]+ 可以匹配一个或多个数字。
2. 字母范围:使用 [a-z] 表示匹配任意一个小写字母,在ASCII 编码中,a 的代码是 97,z 的代码是 122。
同理,[A-Z]表示匹配任意一个大写字母。
3. 非字母数字范围:使用 [^a-zA-Z0-9] 表示匹配除了字母和数字之外的任意字符。
4. 字符范围:可以使用任意字符来定义一个范围,比如[!@#$%^&*()] 表示匹配其中任意一个字符。
正则表达式中的范围还可以与其他元字符和转义字符组合使用,以达到更复杂的匹配要求。
以下是一些常见的正则表达式范围的例子:1. 检查手机号码格式是否正确:^\d{11}$该正则表达式使用了数字范围和量词 \d{11},表示匹配 11 位数字。
2. 检查邮箱格式是否正确:^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$该正则表达式使用了字母范围、数字范围以及量词 +,表示匹配包含 @ 符号的邮箱地址。
3. 检查密码强度:^(?=.*[a-z])(?=.*[A-Z])(?=.*\d)[a-zA-Z\d]{8,}$该正则表达式使用了字母范围、数字范围、量词 + 和反向预查 (?=...),表示匹配包含大小写字母和数字的至少 8 位密码。
正则表达式范围可以根据实际需求进行扩展和修改,以满足更具体的匹配要求。
在编写正则表达式时,可以参考正则表达式的相关文档、教程和参考资料,如《精通正则表达式》(Mastering Regular Expressions)一书中的相关章节,该书详细介绍了正则表达式的语法和使用方法。
《Mastering Regular Expressions》读书笔记近段涉及到了数据的解析,自然离不开对regularexpressions(正则表达式)的温习;在jdk官方源码中看到了对《masteringregularexpressions,2ndedition》的推荐;由jeffreye.f.friedl 大师主刀,o'reilly于xx年再版。
对o'reilly的书向有好感,像当年误入java的歧途,没看java编程思想之类的,倒看了o'reilly的一本影印版《javainanutshell》,颇留记忆。
正则表达式的“祖先”可以一直上溯至对人类神经系统如何工作的早期研究。
warrenmcculloch和walterpitts这两位神经生理学家研究出一种数学方式来描述这些神经网络。
1956年,一位叫stephenkleene 的数学家在mcculloch和pitts早期工作的基础上,发表了一篇标题为“神经网事件的表示法”的论文,引入了正则表达式的概念。
正则表达式就是用来描述他称为“正则集的代数”的表达式,因此采用“正则表达式”这个术语。
随后,发现可以将这一工作应用于使用kenthompson的计算搜索算法的一些早期研究,kenthompson是unix的主要发明人。
正则表达式的第一个实用应用程序就是unix中的qed编辑器。
目前,正则表达式已经在很多软件中得到广泛的应用,包括*nix (linux,unix等),hp等操作系统;php,perl,python,c#,java等开发环境,以及很多的应用软件中,forexample:网络上的搜索引擎,数据库的全文检索etc...本笔记是是自我学习过程的一个整理,例子或来源于书本,或自己枚举。
好了,废,如iraqi,qasida,zaqqum,iraq;没错,"iraq"这个单词也会被匹配,尽管q后面什么也没有,也可能有个空格、或回车符等。
Regular expression(正则表达式)是一种用来描述、匹配和处理字符串的特殊文本模式。
它是一种强大、灵活且通用的工具,被广泛应用于各种编程语言和技术中,用来进行字符串处理、文本搜索、替换和匹配等操作。
在计算机科学和软件开发领域中,正则表达式是非常重要且基础的概念,掌握正则表达式的语法和用法可以极大地提高程序员的工作效率和代码质量。
1. 正则表达式的基本概念正则表达式由普通字符(例如字母、数字、符号等)和特殊字符(元字符)组成,用来描述和匹配字符串的模式。
在正则表达式中,每个字符都有特定的含义和作用,例如“\d”表示匹配一个数字字符,“\w”表示匹配一个单词字符,“.”表示匹配任意字符等。
通过组合不同的字符和元字符,可以构建出复杂的匹配规则,用来满足特定的文本处理需求。
2. 基本语法和常用元字符正则表达式的语法和元字符在不同的编程语言和工具中略有差异,但基本概念和常用元字符是相通的。
常见的元字符包括:- ^:匹配字符串的开头- $:匹配字符串的结尾- *:匹配前面的字符零次或多次- +:匹配前面的字符一次或多次- ?:匹配前面的字符零次或一次- \d:匹配一个数字字符- \w:匹配一个单词字符- .:匹配任意字符- [...]:匹配方括号中任意一个字符- (pattern):匹配pattern并获取匹配结果- |:匹配|前或|后的字符3. 正则表达式的应用场景正则表达式广泛应用于文本处理和字符串匹配场景,常见的应用包括:- 数据验证:通过正则表达式可以对用户输入的数据进行验证,例如电流新箱、通联方式号码、唯一识别信息号码等格式。
- 文本搜索:在文本编辑器或IDE中可以使用正则表达式进行文本搜索和替换操作,快速定位和处理特定模式的文本。
- 数据提取:在数据处理和分析中,可以使用正则表达式从复杂的文本中提取出需要的信息,例如日志文件、HTML代码等。
- URL路由:在Web开发中,可以使用正则表达式匹配URL路由规则,实现灵活的URL匹配和处理。
Mastering Regular Expressions (mini version)Jeffrey E. F. Friedl1st Edition January 19971-56592-257-3, 366 pagesRegular expressions, a powerful tool for manipulating text and data, are found in scripting languages, editors, programming environments, and specialized tools. In this book, author Jeffrey Friedl leads you through the steps of crafting a regular expression that gets the job done. He examines a variety of tools and uses them in an extensive array of examples, with amajor focus on Perl.Release Team[oR] 2001Preface 1 Why I Wrote This BookAudienceIntended1 Introduction to regular expressions 21.1 What are regular expressions used for1.2 Solving real problems1.3 Regular expressions as a language1.4 The filename analogy1.5 The language analogy1.6 The regular expression from in mind1.7 Searching text files1.8 Grep, egrep, fgrep, perl, say what?2 Character classes 52.1 Matching list of characters2.2 Negated character classes2.3 Character class and special characters2.4 POSIX locales and character classes3 Regular expressions syntax 93.1 Marking start and end3.2 Matching any character3.3 Alternation and grouping3.4 Alternation and anchors3.5 Word boundaries3.6 Quantifiers (basic, greedy)3.7 Quantifiers (basic, additional)3.8 Quantifiers (extended, non-greedy)3.9 Ignoring case3.10 Parentheses and back references3.11 Problems with parentheses3.12 The escape character - backslash3.13 Backslash character notation in regular expressions3.14 Line endings \r \n and operating systems3.15 Perl shorthand regular expressions4 Perl zero width assertions 184.1 Beginning of line (^) and (\A)4.2 End of line ($) and (\Z) and (\z)4.3 Word (\b) and non-word (\B) boundaries4.4 Match continue from last position (\G)5 Perl Regular expression modifiers 205.1 Perl match operator5.2 Perl substitute command5.3 Modifiers in matches5.4 Do not reset position or continue (c)5.5 Global matching (g)5.6 Ignore case (i)5.7 Lock regular expression (o)5.8 Span multiple lines (m)5.9 Single line matches and dot (s)5.10 Extended writing mode (x)5.11 Evaluate perl code (e)6 Perl Extended regular expression patterns 256.1 Comment (?#text)6.2 Modifiers (?imsx-imsx)6.3 Non-capturing parenthesis (?:pattern)6.4 Zero-width positive lookahead (?=pattern)6.5 Zero-width negative lookahead (?!pattern)6.6 Zero-width positive lookbehind (?<=pattern)6.7 Zero-width negative lookbehind (?<!pattern)6.8 Zero-width Perl eval assertion (?{ code })6.9 Postponed expression (??{ code })subexpression (?>pattern)6.10Independent6.11 Conditional pattern (?(condition)yes-pattern|no-pattern)7 Regular expression discussion 287.1 Matching numeric ranges7.2 Pay attention to the use of .*7.3 Variable names7.4 A String within double quotes7.5 Dollar amount with optional cents7.6 Matching range of numbers7.7 Matching temperature values7.8 Matching whitespace7.9 Matching text between HTML tags7.10 Matching something inside parenthesis7.11 Reducing number of decimals to three (substituting)8 Different regular expression engines 338.1 Regexp engine types8.2 NFA engine relies on regexp (Perl, Emacs)8.3 DFA engine reads text8.4 Crafting regular expressionsin capabilitiesDifferences8.59 Appendix A - regular expressions 369.1 Perl regular expression syntax9.2 Regular expression engines9.3 Regular expression rules9.4 How to write good regular expressions9.5 Understanding negative lookahead10 Appendix B - Perl language 4010.1 Perl manual pages10.2 Useful Perl command line switches10.3 Perl environment variablesPrefaceThis book is about a powerful tool called "regular expressions."Here, you will learn how to use regular expressions to solve problems and get the most out of tools that provide them. Not only that, but much more: this book is about mastering regular expressions.If you use a computer, you can benefit from regular expressions all the time (even if you don't realize it). When accessing World Wide Web search engines, with your editor, word processor, configuration scripts, and system tools, regular expressions are often provided as "power user" options. Languages such as Awk, Elisp, Expect, Perl, Python, and Tcl have regular-expression support built in (regular expressions are the very heart of many programs written in these languages), and regular-expression libraries are available for most other languages. For example, quite soon after Java became available, a regular-expression library was built and made freely available on the Web. Regular expressions are found in editors and programming environments such as vi, Delphi, Emacs, Brief, Visual C++, Nisus Writer, and many, many more. Regular expressions are very popular. There's a good reason that regular expressions are found in so many diverse applications: they are extremely powerful. At a low level, a regular expression describes a chunk of text. You might use it to verify a user's input, or perhaps to sift through large amounts of data. On a higher level, regular expressions allow you to master your data. Control it. Put it to work for you. To master regular expressions is to master your data.Why I Wrote This BookYou might think that with their wide availability, general popularity, and unparalleled power, regular expressions would be employed to their fullest, wherever found. You might also think that they would be well documented, with introductory tutorials for the novice just starting out, and advanced manuals for the expert desiring that little extra edge.Sadly, that hasn't been the case. Regular-expression documentation is certainly plentiful, and has been available for a long time. (I read my first regular-expression-related manual back in 1981.) The problem, it seems, is that the documentation has traditionally centered on the "low-level view" that I mentioned a moment ago. You can talk all you want about how paints adhere to canvas, and the science of how colors blend, but this won't make you a great painter. With painting, as with any art, you must touch on the human aspect to really make a statement. Regular expressions, composed of a mixture of symbols and text, might seem to be a cold, scientific enterprise, but I firmly believe they are very much creatures of the right half of the brain. They can be an outlet for creativity, for cunningly brilliant programming, and for the elegant solution.I'm not talented at anything that most people would call art. I go to karaoke bars in Kyoto a lot, but I make up for the lack of talent simply by being loud. I do, however, feel very artistic when I can devise an elegant solution to a tough problem. In much of my work, regular expressions are often instrumental in developing those elegant solutions. Because it's one of the few outlets for the artist in me, I have developed somewhat of a passion for regular expressions. It is my goal in writing this book to share some of that passion.Intended AudienceThis book will interest anyone who has an opportunity to use regular expressions. In particular, if you don't yet understand the power that regular expressions can provide, you should benefit greatly as a whole new world is opened up to you. Many of the popular cross-platform utilities and languages that are featured in this book are freely available for MacOS, DOS/Windows, Unix, VMS, and more. Appendix A has some pointers on how to obtain many of them.Anyone who uses GNU Emacs or vi, or programs in Perl, Tcl, Python, or Awk, should find a gold mine of detail, hints, tips, and understanding that can be put to immediate use. The detail and thoroughness is simply not found anywhere else. Regular expressions are an idea—one that is implemented in various ways by various utilities (many, many more than are specifically presented in this book). If you master the general concept of regular expressions, it's a short step to mastering a particular implementation. This book concentrates on that idea, so most of the knowledge presented here transcend the utilities used in the examples.1.0 Introduction to regular expressions1.1 What are regular expressions used forHere comes the scenario: Your boss in the documentation department wants a tool to check double words e.g. "this this", a common problem with documents subject to heavy editing. Your job is to create a solution that will:•Accept any number of files to check, report each line of each file that has double words.•Work across lines, find word even in separate lines.•Find double words in spite of the capitalization differences "The", "the", as well as allowing whitespace in between the words.•Find doubled words that might even be separated by HTML tags. "it s very <I>very</I> important"That is not an easy task! If you use such a tool for existing documents, you may surprisingly find similar spelling mistakes in various sources. There are many programming languages one could use to solve the problem, but one with regular expression support can make the job substantially easier.Regular Expressions are the key to powerful, flexible, and efficient text processing. Regexps themselves, with a general pattern notation, almost like a mini programming language, allow you to describe and parse text. With additional support provided by the particular tool being used, regular expressions can add, remove, isolate, and generally fold, spindle all kinds of text and data. It might be as simple as text editor's search command or as powerful as a full text processing language. You have to start thinking in means of Regexps, and not the the way you have used to with your previous programming languages, because only then you are taking the full magnitude of their power.The host language (Perl, Python, Emacs Lisp) provides the peripheral processing support, but the real power comes from regular expressions. Using the Regexps right will make it possible to identify the text you want and bypass the portions that you are not interested in.1.2 Solving real problemsChecking text in filesAs a simple example, suppose you need to check slew of files (70-150) to confirm that each file contained SetSize exactly as often as contained ResetSize. To complicate matters, you should disregard the capitalization and accept SETSIZE. The total count of lines in those files could easily end up to 30000 or more and checking them by hand would give you a headache. Even using normal "find this word" with text processor would have been truly arduous, what with all the files and all the possible capitalizations. Regexps come to rescue. Typing just a single short command your make the work in seconds and confirm what you want to know.% perl -0ne "print qq($ARGV\n) if s/ResetSize//ig != s/setSize//ig" *Summary of Email mailboxIf you wanted to create a summary of the messages in your mailbox, it would be tedious to read all your 1000 mails and store the important lines to a separate lines by and (like From: and Subject:). What if you were behind dial-up? The on-line time spend in making such summary easily eats your pocket if you had to do it multiple times. In addition, you couldn't do that to some other person, because you would see the contents of his mailbox. Regexps come to rescue again. A very simple command could display summary of those two lines immediately.% perl -ne "print if /*(From|To):/" ~/Mail/*What if someone asked about that summary? It would be non-needed to send the 5000 line results, when you could send that little one-liner to the friend and ask him to run it for his mailbox.1.3 Regular expressions as a languageUnless you have had some experience with regular expressions, you wouldn't understand the above commands. There really is no special magic here, just set of rules that must be digested. once you learn how to hide a coin behind your hand, you know there is not much magic in it, just lot of practice and learning new skills. Like a foreign language, it will start stopping sound like "gibberish" after a while.1.4 The filename analogyIf you have only experience on the Win32/Windows environment, you have a grasp that following refers to multiple files:*.txtWith such filename patters like this (called file globs) there are few characters that have a special meaning.* => means: "MATCH ANYTHING"? => means: "MATCH ONE CHARACTER"The complete example above will be parsed as*.txt||||||||Match three characters in order "t" "x" "t"|Match a "dot"Match anything [A special character]And the whole patters is thus read as "Match files that start with anything and end with .txt"Most systems provide a few additional special characters, but in general these filename patterns are limited in expressive power. This is not much of a shortcoming because the scope of the problem (to provide convenient ways to specify group of files) is limited to filenames.On the other hand, dealing with general text is a much larger problem. Prose and poetry, program listings, reports, lyrics, HTML, articles, code tables, word lists ...you name it. over the years a generalized pattern language has developed which is powerful and expressive for wide variety of uses. Each program implements and uses them differently, but in general this powerful pattern language and the patterns themselves are called Regular Expressions.1.5 The language analogyFull regular expressions are composed of two types of characters. The special characters (like "*" in files) are called meta-characters, while everything else are called literal or normal text characters. What sets regular expressions apart from the filename patterns is the scope of power their meta-characters provide. Filename patterns provide limited patterns, but regular expression "Language" provides rich and expressive power to advanced users.1.6 The regular expression from in mindComplete regular expressions are built up from small building block units. Each building block is in itself quite simple, but since they can be combined in an infinite number of ways, knowing how to combine them to achieve a particular goal takes some experience. While some regular expressions may seem silly, they do really represent the kind of tasks that are done in real - you just might not realize it yet.Just as there are difference between playing musical piece well and making music, there is a difference between understanding regular expressions and really understanding them.1.7 Searching text filesFinding text is the simples uses of regular expressions - many text editors and word processors allow you to search a document using some kind of pattern matching. Let's return to the original example of finding some relevant lines from a mailbox, we study it in detail:Perl command part|| Regular expression part+------+ +-----------+% perl -ne "print if /*(From|To):/" file.txt| | | | || | | | Read from file| | | Start of the code Win32/Unix. Unix also accepts single(')| | || | Give some options| | -n Do not print unless requested| | -e Read expression or code immediately| || Call program "perl"|command shell's prompt. In Unix % or $ and in Win32 typically >Even more simple example would be searching every line containing word like "cat":% perl -ne "print if /cat/" Mail/*.*But things are not that simple, because how do you know which word is plain "cat", you must consider how "catalog", "caterpillar", "vacation" differs semantically from the animal "cat". The matched results do not show what was really matched and made the line selected, the lines are just printed. The key point is that regular expressions searching is not done a "word" basis, but in general only character basis without any knowledge about e.g. English language syntax.1.8 Grep, egrep, fgrep, perl, say what?There is a family of products that started the era of regular expressions in Unix tools know as grep(1),egrep(1), frep(1), sed(1) and awk(1). The first of all was grep, soon followed by extended grep egrep(1), which offered more patterns in regular expression syntax. The final evolution is perl(1) which enhanced the regular expressions way further that could be imaginable. Whenever you nowadays talk about regular expression, the foundation of new inventions lies in Perl language. The Unix man page about regular expression it in ´regexp(5)'.2.0 Character classesYou must THINK that the character class notation is something of its own regular expression sub language. It has its OWN rules that are not the same as outside of character classes.2.1 Matching list of charactersWhat if you want to search all colors of "grey" but also spelled like "gray" with a one character difference. You can define a list of characters to match, a character class. This regexp reads: "Find character g followed by e and try next character with e OR a and finally character y is required."./ge[ea]y/As another example, suppose you want to allow capitalization of word's first letter. Remember that this still matches lines that contain smith or Smith embedded in another word as blacksmith. This issue is usually the source of the problem among new users./[Ss]mith/You can list in the class as many characters as you like. Notice that you can list the items in any order:/[0123456]//[6543210]/Which might be a good set of choices to find HTML heading from the page with: <H1> <H2> .. <H6> (That is the maximum according to HTML 4.x specification. Refer to /)/<H[0123456]>/There are few rules concerning the character class: Not all characters inside it are pure literals, taken "as is". A dash(-) indicates a range of characters, and here is identical example:/<H[0-6]>/One thing to memorize is, that regular expressions are case sensitive. It is different to match "a" or "A", like if you would construct a set of alphabets for regular 7bit English text. (Different countries have different sets of characters, but that is a whole separate issue)/[a-z]/ Ehm.../[a-zA-Z]/ Maybe this is what you wanted?Remember that the dash(-) applies only to a character class, in here it is just a regular dash:/a-z/ Match character "a", character "-", character "z"Multiple dashes can be used inside a class in any order, but the dash-order must follow the ASCII-table sequence, and not run backwards:/[a-zA-Z0-9]/ ok/[A-Za-zA-Z0-9]/ Repetitive, but no problem/[0-9A-Za-z]/ ok/[9-0]/ Oops, you can't say it like "backwards"Exclamation and other special characters are just characters to match:/[!?:,.=]/Or pick your own personal set of characters. This does not match word "help":/[help]/2.2 Negated character classesIt is easy to write what charters you want to include, but what if you would like to match everything except few characters? It would be unpractical to list all the possible character and then leave out only some of them:/[ZXCVBNMASDFGHJKLQWERTYUIO ...]/A special character, inside character class tells "not to include". (this same character has different meaning outside of the class, where it means "beginning of line"):The NOT operator|/[^0-9]/ Match everything, but numbers.NOTE: The end-of-line marker is different is various OS platforms. The above regular expression will match a line containing only plain numbers, because there is embedded end-of-line marker at the end: In Unix "1234567\n", in win32 "1234567\r\n" and in Mac "1234567\r"Why would following regular expression list items below?% perl -ne "print if /q[^u\r\n]/" file.txtIraqiIraqianmiqraqasidaqintarqophzaqqumWhy didn't it list words likeQuantasIraq2.3 Character class and special charactersThe brackets ([])As we have learned earlier, some of the characters in character class are special, these include range(-) and negation(^), but you must remember that the characters continuing the class itself must also be special: ] and [. So, what happened if you really need to match any of these characters? Suppose you have text:See [ref] on page 55.And you need to find all texts that are surrounded within the brackets []. You must write like this, although it looks funny. It works, because an "empty set" is not a valid character class so in here there is not really two "empty character class sets":start of class| End of class| |/[][]/ => also /[]abcdef0-9]]/|||character "]"character "["Rule: ] can be anywhere in the class and [ must be at the beginning of classThe dash (-)If the dash operator is used to delimit a range in a character class we have problem what top do with it if we want to match person names like "Leary-Johnson". The solution can be found if we remember that dash need a FROM-TO, but if we omit either one, and write FROM- or -TO, then the special meaning is canceled./[-a-zA-Z]/ OR /[a-zA-Z-]/Rule: dash(-) character is taken literally, when it is put either to the beginning or to the end of character classThe caret (^)We still have one character that has a special meaning, the negation operator, that excludes characters from the set. We can solve the conflict, to take "^" literally, as plain character when we move it out from its special position: at the beginning of the class/[^abc]/ Means: all except "abc" characters/[abc^]/ Caret has no special meaning any more. Matchescharacters "a" "b" "c" and "^"/[a^bc] Works too. "^" is taken literally.Rule: caret(^) loses its special meaning, when it is not the first character in the class.How to put all togetherHuh, do we dare to combine all these exceptions in one regular expression that would say, "I want these character: ^, - , ] and [". It might be impossible or at least time consuming task if you didn't know the rules of these characters. With trial and error you could eventually come up with right solution, but you would never understand fully why it works. Here is the answer. Can you think of more possible choices?/[][^-]/And now the final master thesis question: how do you reverse the question, "I want to match everything, except characters ^, - , ] and [" ??2.4 POSIX locales and character classesPOSIX, short for Portable Operating System Interface, is a standard ensuring portability across operating systems. Within this ambitious standard are specifications for regular expressions and many of the traditional Unix tools use them.One feature of the POSIX standard is the notion of locale, setting which describe language and cultural conventions such as the display of dates, times and monetary values, the interpretation of characters in the active encoding, and so on. Locales aim at allowing programs to be internationalized. It is not regexp-specific concept, although it can affect regular expression use. For example when working with Latin-1 (ISO-8859-1) encoding, the character "a" has many different meanings in different languages (think adding ' at top of "a"). Perl defines \w to be word and as regexps [a-zA-Z0-9_], but this in not the whole story, since perl respects the use locale directive in programs and thus allows enlarging the A-Z character range.POSIX collating sequenceA locale can define collating sequences to describe how to treat certain characters or sets of characters, for sorting. For example Spanish ll as in *tortilla' traditionally sorts as if it were on logical character between l and m. These rules might be manifested in collating sequences named span-ll and eszet for German ss. As with span-ll, a collating sequence can define multi-character sequences that should be taken as single character. This means that the dot(.) in regular expression /torti.a/ matches "tortilla".POSIX character classA POSIX character class is one of several special meta sequences for use within a POSIX bracket expression. An example is [:lower:] which represents any lowercase letter within the current locale. For normal English that would be [a-z]. The exact list of POSIX character classes is locale independent, but the following are usually supported (appeared 2000-06 in perl 5.6). See more from the [perlre] manual page.[:class:] GENERIC POSIX SYNTAX, replace "class" with names below[:^class:] PERL EXTENSION, negated class[:alpha:] alphabetic characters[:alnum:] alphabetic characters and numeric characters[:ascii:][:cntrl:] control characters[:digit:] \d digits[:graph:] non-blank (no spaces or control characters)[:lower:] lowercase alphabetics[:print:] like "graph" but includes space[:punct:] punctuation characters[:space:] \s all whitespace characters[:upper:] uppercase alphabetics[:word:] \w[:xdigit:] any hexadecimal digit, [0-0a-fA-F]Here is an example how to use the basic regular expression syntax and the roughly equivalent POSIX syntax: They match a word that is started with uppercase letter./[A-Z][a-z]+//[[:upper:]][[:lower:]]+/POSIX character equivalentsSome locales define character equivalents to indicate that certain characters should be considered identical for sorting. The equivalent characters are lister with the [=...=] notation. For example to match Scandinavian "a" like characters, you could use [=a=]. Perl 5.6 2000-06 recognizes this syntax, but does not support it and according to [perlre] manual page: "The POSIX character classes [.cc.] and [=cc=] are recognized but not supported and trying to use them will cause an error: Character class syntax [= =] is reserved for future extensions"3.0 Regular expressions syntax3.1 Marking start and endA good start to regular expression is to discuss how regular expression define beginning-of-line (^) and end-of-line ($). Both have special meta-characters that mark the position correctly. As we have seen ´cat' will be batched everywhere in the line, but we may want to anchor the match to the start of the line. Get into habit interpreting the regular expressions in a rather literal way, don't loosen up your mind or you will read the regular expression wrongly. [IMPORTANT] The ^ and $ are particular in that they match a position. They do match any actual characters themselves./^cat/WRONG: matches line with "cat" at the beginning RIGHT: Matches at the beginning of line, FOLLOWED by character "c" and character "a" and character "t".How would you read following expressions:/^cat$//^$//$^//^//$////cat^//$cat/ # This is not "end-of-line" + "cat", but a variable3.2 Matching any characterThe meta-character dot(.) is shorthand for a pattern that matches an character, except newline. For example, if you want to search regular ISO 8601 YYYY-MM-DD dates like 2000-06-01, 2000/06/01 or 2000.06.01, you could construct the regular expression using the character classes or just allow any character in between:Note, the "/" must be escaped with \/ becauseit would otherwise terminate Perl Regexp / ..... /|/2000[.\/-]06[.\/-]/ This is more accurate/2000.06.01/ The "." accepts anything in placeNotice the different semantics again in the above regexps: The dot(.) is not a meta-character inside the character class, like in the first example. It only has the special meaning if it is used alone, outside of the class like in the second example. [IMPORTANT] Consider using dot(.) only if you know that the data is in consistent format, because it may cause trouble and match lines that you didn't want to, like lottery numbers. The first regexp is the most safest to use compared to second, which will match:Lottery this week: 12 2000106 01 20====.==.==3.3 Alternation and groupingWhen you are inclined to choose from several possibilities, you mean word OR. The regular expression atom for it is |, like in programming languages. When used in regexps, the parts of the regular expressions are called alternations.Try "Bob" first. If not found, then try "Joe" ...|/Bob|Joe|Mike|Helen/===This part is tried completely before moving to nextalternation after "|". The alternations are tried in orderfrom left to right (but refer to DFA and NFA engines)Looking back to color matching with gr[ea]y, it could have been written using the alternation/grey|gray/ Both of the regexps are effectively/gr[ea]y/ ..the same, but gr[ea]y is faster.。
近段涉及到了数据的解析,自然离不开对Regular Expressions(正则表达式)的温习;在jdk 官方源码中看到了对《Mastering Regular Expressions, 2nd Edition》的推荐;由Jeffrey E.F. Friedl大师主刀,O'Reilly于2002年再版。
对O'Reilly的书向有好感,像当年误入java的歧途,没看Java编程思想之类的,倒看了O'Reilly的一本影印版《java in a nutshell》,颇留记忆。
正则表达式的“祖先”可以一直上溯至对人类神经系统如何工作的早期研究。
Warren McCulloch 和Walter Pitts 这两位神经生理学家研究出一种数学方式来描述这些神经网络。
1956 年, 一位叫Stephen Kleene 的数学家在McCulloch 和Pitts 早期工作的基础上,发表了一篇标题为“神经网事件的表示法”的论文,引入了正则表达式的概念。
正则表达式就是用来描述他称为“正则集的代数”的表达式,因此采用“正则表达式”这个术语。
随后,发现可以将这一工作应用于使用Ken Thompson 的计算搜索算法的一些早期研究,Ken Thompson 是Unix 的主要发明人。
正则表达式的第一个实用应用程序就是Unix 中的qed 编辑器。
目前,正则表达式已经在很多软件中得到广泛的应用,包括*nix(Linux, Unix等),HP等操作系统;PHP,Perl,Python,C#,Java等开发环境,以及很多的应用软件中,For Example:网络上的搜索引擎,数据库的全文检索etc...本笔记是是自我学习过程的一个整理,例子或来源于书本,或自己枚举。
好了,废话一箩筐,切入正题。
1.正则表达式的介绍1.1、行开始和结束^begin line。
匹配行开头,如^cat匹配以cat开头的$end line。
匹配行结束,如cat$匹配以cat结束的;^cat$仅仅匹配该行有cat1.2、匹配给定的字符序列[...],表示in。
里面写入欲匹配的几个字符,如<sep[ea]r[ea]te>,匹配seperate,separete,separate";<H[123456]>匹配<H1>, <H2>, <H3>, etc.[a-z]代表从a到z 中的任意字符,[0-9]、[A-Z]分别代表0-9,A-Z中的任意数字或大写字母;“-”代表连续的从开始字符到结束;那么[0123456789abcdefABCDEF]也可以写为[0-9a-fA-F];对于这些频繁使用的字符,各语言分别做了相同的预定义。
1.3、匹配非给定的字符(非...)[^]匹配,表示not。
^和行开头的标记完全一样,但写的位置不一样,则表述的意思可能完全相反,用^表示否定的意思,更多是写在[]里面,如:q[^u]匹配q后面紧跟非u的字符,如Iraqi,qasida,zaqqum,Iraq;没错,"Iraq"这个单词也会被匹配,尽管q后面什么也没有,也可能有个空格、或回车符等。
否定字符的意思(翻译出来绕口):means "match a character that's not listed" and not "don't match what is listed."1.4、匹配任何字符.匹配,表示any。
任何字符,如07.04匹配:07_04,07-04,07 04,07.04 etc;如想要精确匹配07/04,07-04,or 07.04;需要写07[-./]04;没错当.在[]里面包含的时候,仅仅表示“.”字符而已,如果不在[]里面,需要转义\\. 如匹配形如x.y的小数:是[0-9]\\.[0-9],而非[0-9].[0-9]1.5、匹配几个给定的字符序列中的一个|匹配,表示or。
譬如gr[ea]y,也可以这样写:grey|gray 或gr(a|e)y;表的意思完全一样,即只能匹配gray或grey这两个单词。
再举个例子:假如我们读取一封邮件如:From: elvis@ (The King)Subject: be seein' ya aroundDate: Thu, 31 Oct 96 11:04:13Hi, ......... .................. .................. ......... ......... ......... .........yours smith我们打算只读取邮件的From: Subject: Date:(发件人、主题、发送日期)三行内容,那么我们可以用这个表达式来完成:"^(From|Subject|Date):"。
^表示行开头,匹配From或Subject或Date,后面紧跟:的内容;是的,当我们从指定的几个字符序列当中匹配其中的一个时候,需要借助()来对其进行分组,当然()的用途不仅体现在这里,后面还会赘述。
|的匹配是就近匹配原则,所以这样的写法就错误了:^From|Subject|Date: ,这个表达式描述的意思是,只要^From 或者Subject 或者Date:。
1.6、匹配单词边界与java中的转义字符\<、\>分别匹配单词的开头或结尾,如查找cat这个单词:\<cat\>。
\<cat或cat\>查找以cat开头或结尾的单词。
注:单词边界的匹配符并非所有版本的表达式语言都支持,至少在java中就不是这样。
在java中定义了“\b”匹配单词边界,“\B”匹配非单词边界,而没有定义匹配单词开头或结尾的边界符。
根据Java语法的约定,Java 源代码的字符串中的反斜线被解释为Unicode 转义或其他字符转义;目前被预定义的有:\b \t \n \f \r \" \' \\。
分别表示退格符、制表符、换行符、分页符、回车符、双引号、单引号、反斜线。
所以在java字符串中,除了这几个字符,其他任何必须用单斜线的写法都是错误。
因此必须在字符串字面值中使用两个反斜线,表示正则表达式受到保护,不被Java 字节码编译器解释;它们是为将来扩展正则表达式语言保留的,呵呵。
例如,当解释为正则表达式时,字符串字面值"\b" 与单个退格字符匹配,而"\\b" 与单词边界匹配。
字符串"\(hello\)" 是非法的,将导致编译时错误;要与字符串(hello) 匹配,必须使用字符串字面值"\\(hello\\)"。
小结:字符名称含义. dot 任何字符[?] character 给定的字符序列[^?] negated character 非给定的字符序列^ caret 行开头$ dollar 行结尾\< backslash less-than 单词开始边界*不是所有表达语言都支持的\> backslash greater-than 单词结束边界*不是所有表达语言都支持的| or; bar 或,必须匹配给定中的一个字符序列(?) parentheses 用于|的分界分组,或者作为捕获组1.7、数量词(匹配次数)X? X,一次或一次也没有X* X,零次或多次X+ X,一次或多次X{n} X,恰好n 次X{n,} X,至少n 次X{n,m} X,至少n 次,但是不超过m 次举例,我们想匹配<HR >或者<HR SIZE = 14 >,则可以写为<HR +(SIZE = 14)? >;想匹配<H1>..<H6>,则<H[1-6]+>。
注:X{1,}等于X+1.8、子表达式想对于一个完整的表达式而言,中间用()起来的部分被称为子表达式,如:^(Subject|Date):,Subject|Date就是一个子表达式,而[]起来的部分不能成为子表达式,如H[1-6]+,1-6不能成为子表达式。
几个例子:如匹配金额,如:$5、$44、$5.49等,可以这样写:\$[0-9]+(\.[0-9][0-9])?,分解成这样三个部分:\$ 和 ...+ 和...?,标识匹配$符(加斜线表示匹配美元符,而不是行结束标记符),至少一个数字,可包含含有小数点的两位小数。
如匹配时间,如"9:17 am" 、"12:30 pm",可以这样写:[0-9]?[0-9]:[0-9][0-9] (am|pm)。
它会匹配9:17 am 或者12:30 pm,但也匹配99:99 pm。
显然是不正确的,小时可有1位或2位,2位时候,第1位只能是1,第2位只能是0或1或2,那么小时部分可以写:(1[012]|[1-9])。
分钟部分,第1位是0-5,第2位是0-9;于是正确的写法是:(1[012]|[1-9]):[0-5][0-9] (am|pm)。
当我们采用2位的24小时制的时候,小时部分可以写为:0?[0-9]|1[0-9]|2[0-3]。
1.9、反向引用表达式在匹配时,表达式引擎会将小括号"( )" 包含的表达式所匹配到的字符串记录下来。
"小括号包含的表达式所匹配到的字符串" 不仅是在匹配结束后才可以使用,在匹配过程中也可以使用。
表达式后边的部分,可以引用前面"括号内的子匹配已经匹配到的字符串"。
引用方法是"\" 加上一个数字。
重复搜索前面某个分组匹配的文本,例如,\1代表分组1匹配的文本。
如:(\b[a-z]+\b) +\1可以用来匹配重复的单词,像go go, kitty kitty。
首先是一个单词,也就是单词开始处和结束处之间的多于一个的字母(\b[a-z]+\b),然后是1个或几个空白符,最后是前面匹配的那个单词(\1)。
"\1" 引用第1对括号内匹配到的字符串,"\2" 引用第2对括号内匹配到的字符串……以此类推,如果一对括号内包含另一对括号,则外层的括号先排序号。
换句话说,哪一对的左括号"(" 在前,那这一对就先排序号。
额滴神啊,第一章终于学习完了,笔记主要对举的例子做了详细翻译,文字描述部分没有多赘述。
第一章还只是个入门介绍;第二章难度会提高很多了。
To be continued....2.扩展正则表达式的应用在这一章,主要通过几个例子的应用,来介绍如何应用正则表达式来匹配文本和查找文本。