๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY36. Python ์ •๊ทœํ‘œํ˜„์‹

LEE_BOMB 2021. 11. 4. 13:57
์ •๊ทœ ํ‘œํ˜„์‹(Regular Expressions)  

ํŠน์ •ํ•œ ๊ทœ์น™์„ ๊ฐ€์ง„ ๋ฉ”ํƒ€๋ฌธ์ž๋ฅผ ์ด์šฉํ•˜์—ฌ ํŒจํ„ด์„ ์ง€์ •ํ•œ ๋ฌธ์ž์—ด ํ‘œํ˜„

 


[์ฃผ์š” ๋ฉ”ํƒ€๋ฌธ์ž]
.x : ์ž„์˜์˜ ํ•œ ๋ฌธ์ž ๋’ค์— x๊ฐ€ ์˜ค๋Š” ๋ฌธ์ž์—ด(ex : abc, mbc -> .bc) 
^x : x๋กœ ์‹œ์ž‘ํ•˜๋Š” ๋ฌธ์ž์—ด(์ ‘๋‘์–ด ์ถ”์ถœ)
x$ : x๋กœ ๋๋‚˜๋Š” ๋ฌธ์ž์—ด(์ ‘๋ฏธ์–ด ์ถ”์ถœ)
x. : x ๋‹ค์Œ์— ์ž„์˜์˜ ํ•œ ๋ฌธ์ž๊ฐ€ ์˜ค๋Š” ๋ฌธ์ž์—ด(ex : t1, t2, ta -> t.) 
x* : x๊ฐ€ 0๋ฒˆ ์ด์ƒ ๋ฐ˜๋ณต(์—†๋Š” ๊ฒฝ์šฐ ํฌํ•จ)
x+ : x๊ฐ€ 1ํšŒ ์ด์ƒ ๋ฐ˜๋ณต
x? : x๊ฐ€ 0 ~ 1ํšŒ ์กด์žฌ
x{m, n} : x๊ฐ€ m~n ์‚ฌ์ด ์—ฐ์† 
x{m, } : x๊ฐ€ m ์ด์ƒ ์—ฐ์†
x{,n} : x๊ฐ€ n ์ดํ•˜ ์—ฐ์†
[x] : x๋ฌธ์ž ํ•œ ๊ฐœ ์ผ์น˜
| : or ์กฐ๊ฑด์‹ 
\ : ์ด์Šค์ผ€์ดํ”„ ๋ฌธ์ž๋ฅผ ์ผ๋ฐ˜๋ฌธ์ž๋กœ ์ธ์‹
\d : ์ˆซ์ž 
\w : ๋‹จ์–ด 
\s : ๊ณต๋ฐฑ 
() : ๊ทธ๋ฃนํ•‘, ์ถ”์ถœํ•  ํŒจํ„ด ์ง€์ •  


st1 = '1234 abcํ™๊ธธ๋™ ABC_555_6 ์ด์‚ฌ๋„์‹œ'
st2 = 'test1abcABC 123mbc 45test'
urls = ['http://news.com/a/test', 'new.com','http://news.com/b/test', 'http//~']
st3 = 'test^ํ™๊ธธ๋™ abc ๋Œ€ํ•œ*๋ฏผ๊ตญ 123$tbc'


๋ชจ๋“ˆ(module) : ํ•จ์ˆ˜ ๋˜๋Š” ํด๋ž˜์Šค๋ฅผ ํฌํ•จํ•œ ํŒŒ์ด์ฌ ํŒŒ์ผ (*.py)
์„ค์น˜ ๊ฒฝ๋กœ : C:/Users/KIM YOON/anaconda3/Lib/re.py


์ •๊ทœํ‘œํ˜„์‹๊ณผ ๋ฌธ์ž์—ด ์ฒ˜๋ฆฌ ํ•จ์ˆ˜ ์ œ๊ณต ๋ชจ๋“ˆ(python file) 

ํ˜•์‹) import ๋ชจ๋“ˆ 

import re # ๋ชจ๋“ˆ(re.py) - ๋ฐฉ๋ฒ•1

 

ํ˜•์‹) from ๋ชจ๋“ˆ import ํ•จ์ˆ˜1, ํ•จ์ˆ˜2, ํ•จ์ˆ˜3, ...

from re import findall, match, sub # ๋ฐฉ๋ฒ•2 : ๊ถŒ์žฅ

 

 




1. findall('pattern', string)
ํŒจํ„ด๊ณผ ์ผ์น˜ํ•˜๋Š” ๋ฌธ์ž์—ด ์ฐพ๊ธฐ -> list ๋ฐ˜ํ™˜ 

1) ์ˆซ์ž ์ฐพ๊ธฐ 

print(re.findall('1234', st1)) # ['1234'] : ๋ฐฉ๋ฒ•1
print(findall('1234', st1)) # ['1234'] : ๋ฐฉ๋ฒ•2
print(findall('[0-9]', st1)) # ['1', '2', '3', '4', '5', '5', '5', '6']
print(findall('[0-9]{3}', st1)) # ['123', '555']
print(findall('[0-9]{3,}', st1)) # ['1234', '555']
print(findall('[0-9]{3,4}', st1)) # ['1234', '555']
print(findall(r'\d{3,4}', st1)) # ['1234', '555']


2) ๋ฌธ์ž์—ด ์ฐพ๊ธฐ 

findall('[๊ฐ€-ํžฃ]{3,}', st1) # ['ํ™๊ธธ๋™', '์ด์‚ฌ๋„์‹œ']
findall('[a-z]{3}', st1) #  ['abc']
findall('[a-z|A-Z]{3}', st1) # ['abc', 'ABC']
findall('[a-z]{4}', st1) # [] null๊ฐ’
words = st1.split() # ๊ณต๋ฐฑ ๊ธฐ์ค€ ํ† ํฐ ์ƒ์„ฑ 
print(words) # ['1234', 'abcํ™๊ธธ๋™', 'ABC_555_6', '์ด์‚ฌ๋„์‹œ']
names = [] # ํ•œ๊ธ€ ์ด๋ฆ„ ์ €์žฅ
for word in words :
    result = findall('[๊ฐ€-ํžฃ]{3,}', word) # '1234'
    print(result) # [], ['ํ™๊ธธ๋™']
    
    if result : # False(null) or True(not null)
        #names.append(result) # ์ค‘์ฒฉ list
        names.extend(result) # ๋‹จ์ผ list
print(names) # [['ํ™๊ธธ๋™'], ['์ด์‚ฌ๋„์‹œ']] -> ['ํ™๊ธธ๋™', '์ด์‚ฌ๋„์‹œ']



3) ์ ‘๋‘์–ด/์ ‘๋ฏธ์–ด ๋ฌธ์ž์—ด ์ฐพ๊ธฐ 

st2 = 'test1abcABC 123mbc 45test'

findall('^test', st2) # ['test']
findall('^text', st2) # []

findall('test$', st2) #  ['test']


abc, mbc

findall('.bc' , st2) # ['abc', 'mbc']
urls = ['http://news.com/a/test', 'new.com','http://news.com/b/test', 'http//~']

print(urls)
#['http://news.com/a/test', 'new.com', 'http://news.com/b/test', 'http//~']

len(urls) # 4
urls_re = [] # ์ •์ƒ url ์ €์žฅ
for url in urls :
    if findall('^http://news.com', url) : # True == not null
        print(url) # ์ •์ƒ์ ์ธ url ์ถœ๋ ฅ 
        urls_re.append(url)


http://news.com/a/test
http://news.com/b/test

print(urls_re) # ['http://news.com/a/test', 'http://news.com/b/test']



4) ๋‹จ์–ด(\w) ์ฐพ๊ธฐ : ๋‹จ์–ด(ํ•œ๊ธ€, ์˜๋ฌธ, ์ˆซ์ž), ๋‹จ์–ด ์•„๋‹˜(ํŠน์ˆ˜๋ฌธ์ž,๋ฌธ์žฅ๋ถ€ํ˜ธ,๊ณต๋ฐฑ)

st3 = 'test^ํ™๊ธธ๋™ abc ๋Œ€ํ•œ*๋ฏผ๊ตญ 123$tbc'

findall(r'\w{3,}', st3) # 3์Œ์ ˆ ์ด์ƒ ๋‹จ์–ด ์ฐพ๊ธฐ 
# ['test', 'ํ™๊ธธ๋™', 'abc', '123', 'tbc']

 

 

5) ๋ฌธ์ž์—ด ์ œ์™ธ : [^์ œ์™ธ๋ฌธ์ž] -> [^t]

findall('[^t]+', st3) #['es', '^ํ™๊ธธ๋™ abc ๋Œ€ํ•œ*๋ฏผ๊ตญ 123$', 'bc']
# ํ•ด๋‹น ๋ฌธ์ž ์ œ์™ธํ•œ ๋‚˜๋จธ์ง€ ๋ฌธ์ž 1๊ฐœ ์ด์ƒ ์—ฐ์†


ํŠน์ˆ˜๋ฌธ์ž ์ œ์™ธ : ^ * $

findall('[^^*$]+', st3) #['test', 'ํ™๊ธธ๋™ abc ๋Œ€ํ•œ', '๋ฏผ๊ตญ 123', 'tbc']