๋ฐ์ดํ„ฐ๋ถ„์„๊ฐ€ ๊ณผ์ •/Python

DAY56. Python TextMining (1)WebCrawling (url, tag, html)

LEE_BOMB 2021. 12. 9. 19:57

Text Mining 4๋‹จ๊ณ„
1. ๋ฌธ์„œ ์ˆ˜์ง‘ (Text Crawling)
2. ํ˜•ํƒœ์†Œ ๋ถ„์„ (KONLPY)
3. ์‹œ๊ฐํ™” (Word Cloud)
4. ํฌ์†Œํ–‰๋ ฌ (Sparse Matrix)



1. ๋ฌธ์„œ ์ˆ˜์ง‘
1) HTML ์›น๋ฌธ์„œ
2) URL ๊ธฐ๋ฐ˜ ์ž๋ฃŒ์ˆ˜์ง‘
3) Query ๊ธฐ๋ฐ˜ ์ˆ˜์ง‘



HTML(Hypertext Markup Language)์›น๋ฌธ์„œ
์›”๋“œ ์™€์ด๋“œ ์›น(www)์„ ํ†ตํ•ด ๋ณผ ์ˆ˜ ์žˆ๋Š” ๋ฌธ์„œ๋ฅผ ๋งŒ๋“ค ๋•Œ ์‚ฌ์šฉํ•˜๋Š” ์›น ์–ธ์–ด ์ข…๋ฅ˜
๋‹ค์–‘ํ•œ ํƒœ๊ทธ(tag)๋ฅผ ์ด์šฉํ•˜์—ฌ ์›น๋ฌธ์„œ๋ฅผ ์ž‘์„ฑ

โ‘  ํƒœ๊ทธ ํ˜•์‹ : <์‹œ์ž‘ํƒœ๊ทธ> ๋‚ด์šฉ </์ข…๋ฃŒํƒœ๊ทธ>
ํƒœ๊ทธ ๊ฒ€์‚ฌ ๊ธฐ๋Šฅ : ์›น ๋ธŒ๋ผ์šฐ์ €์—์„œ ์›น๋ฌธ์„œ์˜ ํƒœ๊ทธ ํ™•์ธ ๋ฐ ๊ฒ€์ƒ‰ ๊ธฐ๋Šฅ (๋‹จ์ถ•ํ‚ค : F12)

โ‘ก ํƒœ๊ทธ ์†์„ฑ(attribute) : ํ•ด๋‹น ํƒœ๊ทธ์— ๊ธฐ๋Šฅ์„ ์ง€์ •ํ•˜๋Š” ์—ญํ• 
ํ˜•์‹) <ํƒœ๊ทธ ์†์„ฑ> ๋‚ด์šฉ </์ข…๋ฃŒํƒœ๊ทธ>

โ‘ข ์„ ํƒ์ž(selector) : ํŠน์ • ํƒœ๊ทธ๋ฅผ ๋Œ€์ƒ์œผ๋กœ ๋””์ž์ธ์„ ์ ์šฉ ์œ„ํ•ด ์‚ฌ์šฉ๋˜๋Š” ์†์„ฑ
ํ˜•์‹) <ํƒœ๊ทธ id=โ€˜id์ด๋ฆ„โ€™> ๋‚ด์šฉ </์ข…๋ฃŒํƒœ๊ทธ>
<ํƒœ๊ทธ class=โ€˜ํด๋ž˜์Šค์ด๋ฆ„โ€™> ๋‚ด์šฉ </์ข…๋ฃŒํƒœ๊ทธ>
id์†์„ฑ : ์ค‘๋ณต ๋ถˆ๊ฐ€, class ์†์„ฑ : ์ค‘๋ณต ๊ฐ€๋Šฅ




url request

<์ž‘์—…์ ˆ์ฐจ>
1. url ์š”์ฒญ -> ์‘๋‹ต(source)
2. source -> html ํŒŒ์‹ฑ(html ๋ฌธ์„œ ๋ณ€ํ™˜)
3. ํƒœ๊ทธ(tag) ๊ฒ€์ƒ‰ -> ๋‚ด์šฉ ์ˆ˜์ง‘

from urllib.request import urlopen #url ์š”์ฒญ from bs4 import BeautifulSoup #html ํŒŒ์‹ฑ


url ์š”์ฒญ

url = "https://www.naver.com/index.html"




1. ์›๊ฒฉ ์„œ๋ฒ„ url ์š”์ฒญ

req = urlopen(url) # ์š”์ฒญ -> ์‘๋‹ต(source) source = req.read() # source ์ฝ๊ธฐ print(source)




2. ๋””์ฝ”๋”ฉ & html ํŒŒ์‹ฑ

data = source.decode("utf-8") # charset="utf-8" html = BeautifulSoup(data, 'html.parser') print(html)




3. ํƒœ๊ทธ(tag) ๊ฒ€์ƒ‰ -> ๋‚ด์šฉ ์ˆ˜์ง‘

a = html.find('a') # find('tag') print(a) # ์ตœ์ดˆ ๋ฐœ๊ฒฌ๋œ <a> tag ์ˆ˜์ง‘

<a href="#newsstand"><span>๋‰ด์Šค์Šคํƒ ๋“œ ๋ฐ”๋กœ๊ฐ€๊ธฐ</span></a>
<ํƒœ๊ทธ ์†์„ฑ>๋‚ด์šฉ </ํƒœ๊ทธ> -> tag element

print('a tag ๋‚ด์šฉ : ',a.text) #a tag ๋‚ด์šฉ : ๋‰ด์Šค์Šคํƒ ๋“œ ๋ฐ”๋กœ๊ฐ€๊ธฐ aa = html.find_all('a') # ๋ชจ๋“  ํƒœ๊ทธ ์ˆ˜์ง‘ - list print(aa) len(aa) # 406 aa[-1] #<a data-clk="nhn" href="https://www.navercorp.com" target="_blank">โ“’ NAVER Corp.</a> aa[-1].text # 'โ“’ NAVER Corp.'





tag find

1. html.find('tag') : 1๊ฐœ ํƒœ๊ทธ ์ˆ˜์ง‘
2. html.find_all('tag') : ๋ชจ๋“  ํƒœ๊ทธ ์ˆ˜์ง‘


from bs4 import BeautifulSoup #html ํŒŒ์‹ฑ


1. loacal ์„œ๋ฒ„ ํŒŒ์ผ ์ฝ๊ธฐ

path = 'C:\\ITWILL\\4_Python-2\\workspace\\chap10_TextMining\\data' file = open(path + '/html01.html', mode='r', encoding='utf-8') src = file.read() #decoding ์ƒ๋žต file.close()




2. html ํŒŒ์‹ฑ

html = BeautifulSoup(src, 'html.parser') print(html)




3. ํƒœ๊ทธ ๋‚ด์šฉ ๊ฐ€์ ธ์˜ค๊ธฐ
1) find('tag') : ์ฒ˜์Œ ํƒœ๊ทธ ์ฐพ๊ธฐ

h1 = html.find('h1') h1 # <h1> ์‹œ๋ฉ˜ํ‹ฑ ํƒœ๊ทธ ?</h1>


ํƒœ๊ทธ ๋‚ด์šฉ : string, text

h1.string #' ์‹œ๋ฉ˜ํ‹ฑ ํƒœ๊ทธ ?' h1.text #' ์‹œ๋ฉ˜ํ‹ฑ ํƒœ๊ทธ ?'


string vs text

h2 = html.find('h2') h2 #<h2> ์ฃผ์š” ์‹œ๋ฉ˜ํ‹ฑ ํƒœ๊ทธ <span> span ํƒœ๊ทธ </span> </h2> print(h2.string) #None print(h2.text) #์ฃผ์š” ์‹œ๋ฉ˜ํ‹ฑ ํƒœ๊ทธ span ํƒœ๊ทธ

string : ํ•˜์œ„ tag๊ฐ€ ํฌํ•จ๋œ ๊ฒฝ์šฐ ๋‚ด์šฉ ๋ฐ˜ํ™˜ ์—†์Œ(None)
text : ํ•˜์œ„ tag๊ฐ€ ํฌํ•จ๋œ ๊ฒฝ์šฐ ํ•˜์œ„ tag ๋‚ด์šฉ๊นŒ์ง€ ๋ฐ˜ํ™˜


2) find_all('tag') : ๋ชจ๋“  ํƒœ๊ทธ ์ฐพ๊ธฐ - list ๋ฐ˜ํ™˜

lis = html.find_all('li') print(lis) len(lis) #5


list ๋‚ดํฌ : # li ๋‚ด์šฉ ์ €์žฅ

contents = [li.text for li in lis] print(contents)





tag attr

tag ์†์„ฑ๊ณผ ๋‚ด์šฉ ๊ฐ€์ ธ์˜ค๊ธฐ
tag element : tag + ์†์„ฑ + ๋‚ด์šฉ
ex) <a href="www.naver.com"> ๋„ค์ด๋ฒ„ </a>
<์‹œ์ž‘ํƒœ๊ทธ ์†์„ฑ="๊ฐ’"> ๋‚ด์šฉ </์ข…๋ฃŒํƒœ๊ทธ>


from bs4 import BeautifulSoup #html ํŒŒ์‹ฑ



1. loacal ์„œ๋ฒ„ ํŒŒ์ผ ์ฝ๊ธฐ

path = 'C:\\ITWILL\\4_Python-2\\workspace\\chap10_TextMining\\data' file = open(path + '/html02.html', mode='r', encoding='utf-8') src = file.read() # decoding ์ƒ๋žต file.close()




2. html ํŒŒ์‹ฑ

html = BeautifulSoup(src, 'html.parser') print(html)




3. ํƒœ๊ทธ ์†์„ฑ๊ณผ ๋‚ด์šฉ ๊ฐ€์ ธ์˜ค๊ธฐ

links = html.find_all('a') print(links) #list ๋ฐ˜ํ™˜ len(links) #5 #<a href="www.naver.com">๋„ค์ด๋ฒ„</a>


๋‚ด์šฉ ์ถœ๋ ฅ

for link in links : print(link.text)

๋„ค์ด๋ฒ„
๋„ค์ด๋ฒ„
๋„ค์ด๋ฒ„ ์ƒˆ์ฐฝ์œผ๋กœ
๋‹ค์Œ
๋‹ค์Œ

์†์„ฑ ์ถœ๋ ฅ

urls = [] #url ์ €์žฅ for link in links : #์˜ˆ์™ธ์ฒ˜๋ฆฌ try : #print(link.attrs) #{'href': 'www.naver.com'} - dict print(link.attrs['href']) #url ์ถ”์ถœ urls.append(link.attrs['href']) #url ์ €์žฅ print(link.attrs['target']) #target ์†์„ฑ ๊ฐ’ except Exception as e : print('์˜ˆ์™ธ ๋ฐœ์ƒ : ', e)

'www.naver.com
http://www.naver.com
http://www.naver.com
www.duam.net
http://www.duam.net

print(urls)

['www.naver.com',
'http://www.naver.com',
'http://www.naver.com',
'www.duam.net',
'http://www.duam.net']



4. ์ •๊ทœํ‘œํ˜„์‹์œผ๋กœ ์ •์ƒ url ์„ ์ •

import re new_urls = [] for url in urls : result = re.findall('^http://', url) #print(result) if result : #[] == False new_urls.append(url) print(new_urls) #['http://www.naver.com', 'http://www.naver.com', 'http://www.duam.net']