DAY56. Python TextMining (1)WebCrawling (url, tag, html)
Text Mining 4๋จ๊ณ
1. ๋ฌธ์ ์์ง (Text Crawling)
2. ํํ์ ๋ถ์ (KONLPY)
3. ์๊ฐํ (Word Cloud)
4. ํฌ์ํ๋ ฌ (Sparse Matrix)
1. ๋ฌธ์ ์์ง
1) HTML ์น๋ฌธ์
2) URL ๊ธฐ๋ฐ ์๋ฃ์์ง
3) Query ๊ธฐ๋ฐ ์์ง
HTML(Hypertext Markup Language)์น๋ฌธ์
์๋ ์์ด๋ ์น(www)์ ํตํด ๋ณผ ์ ์๋ ๋ฌธ์๋ฅผ ๋ง๋ค ๋ ์ฌ์ฉํ๋ ์น ์ธ์ด ์ข
๋ฅ
๋ค์ํ ํ๊ทธ(tag)๋ฅผ ์ด์ฉํ์ฌ ์น๋ฌธ์๋ฅผ ์์ฑ
โ ํ๊ทธ ํ์ : <์์ํ๊ทธ> ๋ด์ฉ </์ข
๋ฃํ๊ทธ>
ํ๊ทธ ๊ฒ์ฌ ๊ธฐ๋ฅ : ์น ๋ธ๋ผ์ฐ์ ์์ ์น๋ฌธ์์ ํ๊ทธ ํ์ธ ๋ฐ ๊ฒ์ ๊ธฐ๋ฅ (๋จ์ถํค : F12)
โก ํ๊ทธ ์์ฑ(attribute) : ํด๋น ํ๊ทธ์ ๊ธฐ๋ฅ์ ์ง์ ํ๋ ์ญํ
ํ์) <ํ๊ทธ ์์ฑ> ๋ด์ฉ </์ข
๋ฃํ๊ทธ>
โข ์ ํ์(selector) : ํน์ ํ๊ทธ๋ฅผ ๋์์ผ๋ก ๋์์ธ์ ์ ์ฉ ์ํด ์ฌ์ฉ๋๋ ์์ฑ
ํ์) <ํ๊ทธ id=โid์ด๋ฆโ> ๋ด์ฉ </์ข
๋ฃํ๊ทธ>
<ํ๊ทธ class=โํด๋์ค์ด๋ฆโ> ๋ด์ฉ </์ข
๋ฃํ๊ทธ>
id์์ฑ : ์ค๋ณต ๋ถ๊ฐ, class ์์ฑ : ์ค๋ณต ๊ฐ๋ฅ
url request
<์์
์ ์ฐจ>
1. url ์์ฒญ -> ์๋ต(source)
2. source -> html ํ์ฑ(html ๋ฌธ์ ๋ณํ)
3. ํ๊ทธ(tag) ๊ฒ์ -> ๋ด์ฉ ์์ง
from urllib.request import urlopen #url ์์ฒญ from bs4 import BeautifulSoup #html ํ์ฑ
url ์์ฒญ
url = "https://www.naver.com/index.html"
1. ์๊ฒฉ ์๋ฒ url ์์ฒญ
req = urlopen(url) # ์์ฒญ -> ์๋ต(source) source = req.read() # source ์ฝ๊ธฐ print(source)
2. ๋์ฝ๋ฉ & html ํ์ฑ
data = source.decode("utf-8") # charset="utf-8" html = BeautifulSoup(data, 'html.parser') print(html)
3. ํ๊ทธ(tag) ๊ฒ์ -> ๋ด์ฉ ์์ง
a = html.find('a') # find('tag') print(a) # ์ต์ด ๋ฐ๊ฒฌ๋ <a> tag ์์ง
<a href="#newsstand"><span>๋ด์ค์คํ ๋ ๋ฐ๋ก๊ฐ๊ธฐ</span></a>
<ํ๊ทธ ์์ฑ>๋ด์ฉ </ํ๊ทธ> -> tag element
print('a tag ๋ด์ฉ : ',a.text) #a tag ๋ด์ฉ : ๋ด์ค์คํ ๋ ๋ฐ๋ก๊ฐ๊ธฐ aa = html.find_all('a') # ๋ชจ๋ ํ๊ทธ ์์ง - list print(aa) len(aa) # 406 aa[-1] #<a data-clk="nhn" href="https://www.navercorp.com" target="_blank">โ NAVER Corp.</a> aa[-1].text # 'โ NAVER Corp.'
tag find
1. html.find('tag') : 1๊ฐ ํ๊ทธ ์์ง
2. html.find_all('tag') : ๋ชจ๋ ํ๊ทธ ์์ง
from bs4 import BeautifulSoup #html ํ์ฑ
1. loacal ์๋ฒ ํ์ผ ์ฝ๊ธฐ
path = 'C:\\ITWILL\\4_Python-2\\workspace\\chap10_TextMining\\data' file = open(path + '/html01.html', mode='r', encoding='utf-8') src = file.read() #decoding ์๋ต file.close()
2. html ํ์ฑ
html = BeautifulSoup(src, 'html.parser') print(html)
3. ํ๊ทธ ๋ด์ฉ ๊ฐ์ ธ์ค๊ธฐ
1) find('tag') : ์ฒ์ ํ๊ทธ ์ฐพ๊ธฐ
h1 = html.find('h1') h1 # <h1> ์๋ฉํฑ ํ๊ทธ ?</h1>
ํ๊ทธ ๋ด์ฉ : string, text
h1.string #' ์๋ฉํฑ ํ๊ทธ ?' h1.text #' ์๋ฉํฑ ํ๊ทธ ?'
string vs text
h2 = html.find('h2') h2 #<h2> ์ฃผ์ ์๋ฉํฑ ํ๊ทธ <span> span ํ๊ทธ </span> </h2> print(h2.string) #None print(h2.text) #์ฃผ์ ์๋ฉํฑ ํ๊ทธ span ํ๊ทธ
string : ํ์ tag๊ฐ ํฌํจ๋ ๊ฒฝ์ฐ ๋ด์ฉ ๋ฐํ ์์(None)
text : ํ์ tag๊ฐ ํฌํจ๋ ๊ฒฝ์ฐ ํ์ tag ๋ด์ฉ๊น์ง ๋ฐํ
2) find_all('tag') : ๋ชจ๋ ํ๊ทธ ์ฐพ๊ธฐ - list ๋ฐํ
lis = html.find_all('li') print(lis) len(lis) #5
list ๋ดํฌ : # li ๋ด์ฉ ์ ์ฅ
contents = [li.text for li in lis] print(contents)
tag attr
tag ์์ฑ๊ณผ ๋ด์ฉ ๊ฐ์ ธ์ค๊ธฐ
tag element : tag + ์์ฑ + ๋ด์ฉ
ex) <a href="www.naver.com"> ๋ค์ด๋ฒ </a>
<์์ํ๊ทธ ์์ฑ="๊ฐ"> ๋ด์ฉ </์ข
๋ฃํ๊ทธ>
from bs4 import BeautifulSoup #html ํ์ฑ
1. loacal ์๋ฒ ํ์ผ ์ฝ๊ธฐ
path = 'C:\\ITWILL\\4_Python-2\\workspace\\chap10_TextMining\\data' file = open(path + '/html02.html', mode='r', encoding='utf-8') src = file.read() # decoding ์๋ต file.close()
2. html ํ์ฑ
html = BeautifulSoup(src, 'html.parser') print(html)
3. ํ๊ทธ ์์ฑ๊ณผ ๋ด์ฉ ๊ฐ์ ธ์ค๊ธฐ
links = html.find_all('a') print(links) #list ๋ฐํ len(links) #5 #<a href="www.naver.com">๋ค์ด๋ฒ</a>
๋ด์ฉ ์ถ๋ ฅ
for link in links : print(link.text)
๋ค์ด๋ฒ
๋ค์ด๋ฒ
๋ค์ด๋ฒ ์์ฐฝ์ผ๋ก
๋ค์
๋ค์
์์ฑ ์ถ๋ ฅ
urls = [] #url ์ ์ฅ for link in links : #์์ธ์ฒ๋ฆฌ try : #print(link.attrs) #{'href': 'www.naver.com'} - dict print(link.attrs['href']) #url ์ถ์ถ urls.append(link.attrs['href']) #url ์ ์ฅ print(link.attrs['target']) #target ์์ฑ ๊ฐ except Exception as e : print('์์ธ ๋ฐ์ : ', e)
'www.naver.com
http://www.naver.com
http://www.naver.com
www.duam.net
http://www.duam.net
print(urls)
['www.naver.com',
'http://www.naver.com',
'http://www.naver.com',
'www.duam.net',
'http://www.duam.net']
4. ์ ๊ทํํ์์ผ๋ก ์ ์ url ์ ์
import re new_urls = [] for url in urls : result = re.findall('^http://', url) #print(result) if result : #[] == False new_urls.append(url) print(new_urls) #['http://www.naver.com', 'http://www.naver.com', 'http://www.duam.net']