DAY36. Python 정규표현식

LEE_BOMB 2021. 11. 4. 13:57

정규 표현식(Regular Expressions)

특정한 규칙을 가진 메타문자를 이용하여 패턴을 지정한 문자열 표현

[주요 메타문자]
.x : 임의의 한 문자 뒤에 x가 오는 문자열(ex : abc, mbc -> .bc)
^x : x로 시작하는 문자열(접두어 추출)
x$ : x로 끝나는 문자열(접미어 추출)
x. : x 다음에 임의의 한 문자가 오는 문자열(ex : t1, t2, ta -> t.)
x* : x가 0번 이상 반복(없는 경우 포함)
x+ : x가 1회 이상 반복
x? : x가 0 ~ 1회 존재
x{m, n} : x가 m~n 사이 연속
x{m, } : x가 m 이상 연속
x{,n} : x가 n 이하 연속
[x] : x문자 한 개 일치
| : or 조건식
\ : 이스케이프 문자를 일반문자로 인식
\d : 숫자
\w : 단어
\s : 공백
() : 그룹핑, 추출할 패턴 지정

st1 = '1234 abc홍길동 ABC_555_6 이사도시'
st2 = 'test1abcABC 123mbc 45test'
urls = ['http://news.com/a/test', 'new.com','http://news.com/b/test', 'http//~']
st3 = 'test^홍길동 abc 대한*민국 123$tbc'

모듈(module) : 함수 또는 클래스를 포함한 파이썬 파일 (*.py)
설치 경로 : C:/Users/KIM YOON/anaconda3/Lib/re.py

정규표현식과 문자열 처리 함수 제공 모듈(python file)

형식) import 모듈

import re # 모듈(re.py) - 방법1

형식) from 모듈 import 함수1, 함수2, 함수3, ...

from re import findall, match, sub # 방법2 : 권장

1. findall('pattern', string)
패턴과 일치하는 문자열 찾기 -> list 반환

1) 숫자 찾기

print(re.findall('1234', st1)) # ['1234'] : 방법1
print(findall('1234', st1)) # ['1234'] : 방법2
print(findall('[0-9]', st1)) # ['1', '2', '3', '4', '5', '5', '5', '6']
print(findall('[0-9]{3}', st1)) # ['123', '555']
print(findall('[0-9]{3,}', st1)) # ['1234', '555']
print(findall('[0-9]{3,4}', st1)) # ['1234', '555']
print(findall(r'\d{3,4}', st1)) # ['1234', '555']

2) 문자열 찾기

findall('[가-힣]{3,}', st1) # ['홍길동', '이사도시']
findall('[a-z]{3}', st1) #  ['abc']
findall('[a-z|A-Z]{3}', st1) # ['abc', 'ABC']
findall('[a-z]{4}', st1) # [] null값

words = st1.split() # 공백 기준 토큰 생성 
print(words) # ['1234', 'abc홍길동', 'ABC_555_6', '이사도시']

names = [] # 한글 이름 저장

for word in words :
    result = findall('[가-힣]{3,}', word) # '1234'
    print(result) # [], ['홍길동']
    
    if result : # False(null) or True(not null)
        #names.append(result) # 중첩 list
        names.extend(result) # 단일 list

print(names) # [['홍길동'], ['이사도시']] -> ['홍길동', '이사도시']

3) 접두어/접미어 문자열 찾기

st2 = 'test1abcABC 123mbc 45test'

findall('^test', st2) # ['test']
findall('^text', st2) # []

findall('test$', st2) #  ['test']

abc, mbc

findall('.bc' , st2) # ['abc', 'mbc']
urls = ['http://news.com/a/test', 'new.com','http://news.com/b/test', 'http//~']

print(urls)
#['http://news.com/a/test', 'new.com', 'http://news.com/b/test', 'http//~']

len(urls) # 4

urls_re = [] # 정상 url 저장

for url in urls :
    if findall('^http://news.com', url) : # True == not null
        print(url) # 정상적인 url 출력 
        urls_re.append(url)

http://news.com/a/test
http://news.com/b/test

print(urls_re) # ['http://news.com/a/test', 'http://news.com/b/test']

4) 단어(\w) 찾기 : 단어(한글, 영문, 숫자), 단어 아님(특수문자,문장부호,공백)

st3 = 'test^홍길동 abc 대한*민국 123$tbc'

findall(r'\w{3,}', st3) # 3음절 이상 단어 찾기 
# ['test', '홍길동', 'abc', '123', 'tbc']

5) 문자열 제외 : [^제외문자] -> [^t]

findall('[^t]+', st3) #['es', '^홍길동 abc 대한*민국 123$', 'bc']
# 해당 문자 제외한 나머지 문자 1개 이상 연속

특수문자 제외 : ^ * $

findall('[^^*$]+', st3) #['test', '홍길동 abc 대한', '민국 123', 'tbc']