Selenium

Selenium ์ด๋ž€?
์›๊ฒฉ์œผ๋กœ ํŠน์ • ์›นํŽ˜์ด์ง€์˜ ๋ฒ„ํŠผ ํด๋ฆญ, ์ž…๋ ฅ์ƒ์ž์—์„œ ์ž๋ฃŒ ์ž…๋ ฅ ๋“ฑ์œผ๋กœ ์–ด๋–ค ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š”์ง€ ๋“ฑ์˜ ๋‹ค์–‘ํ•œ ์›น ํŽ˜์ด์ง€์™€ ์‚ฌ์šฉ์ž ๊ฐ„์˜ ์ƒํ˜ธ์ž‘์šฉ์„ ๋™์ ์œผ ๋กœ ์ œ์–ดํ•˜๋Š” ๊ธฐ์ˆ  ๋˜๋Š” ํ”„๋กœ๊ทธ๋žจ
์šฉ๋„ : ๋™์  ์›นํŽ˜์ด์ง€ ์ž๋ฃŒ ์ˆ˜์ง‘(์˜ํ™” ๋ฆฌ๋ทฐ), ๊ตฌ๊ธ€ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘(์ ˆ์ฐจ)


๋™์  ํŽ˜์ด์ง€ vs ์ •์  ํŽ˜์ด์ง€
1) ์ •์  ํŽ˜์ด์ง€
- ์ด๋ฏธ ์ค€๋น„๋˜์–ด ์žˆ๋Š” ์›น๋ฌธ์„œ๋ฅผ ์‚ฌ์šฉ์ž(client)์—๊ฒŒ ์ œ๊ณต
- ์–ธ์ œ ์ ‘์†ํ•ด๋„ ๋™์ผํ•œ ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ์›น์‚ฌ์ดํŠธ
- BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์›น๋ฌธ์„œ ์ˆ˜์ง‘

2) ๋™์  ํŽ˜์ด์ง€
- ์‚ฌ์šฉ์ž(client)์˜ ์š”์ฒญ์„ ๋ฐ›์€ ์‹œ์ ์—์„œ ์›น๋ฌธ์„œ๋ฅผ ์‚ฌ์šฉ์ž์—๊ฒŒ ์ œ๊ณต
- ์‚ฌ์šฉ์ž์˜ ์š”์ฒญ์— ๋”ฐ๋ผ์„œ ์„œ๋กœ ๋‹ค๋ฅธ ๋ฆฌ์†Œ์Šค๋ฅผ ์ œ๊ณตํ•˜๋Š” ์›น์‚ฌ์ดํŠธ
- Selenium ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์ด์šฉํ•˜์—ฌ ์›น๋ฌธ์„œ ๋ฐ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘
ex) google ์ด๋ฏธ์ง€, SNS(Instagram, Youtube, Facebook) ์ž๋ฃŒ ์ˆ˜์ง‘


Selenium ์„ค์น˜
(base) conda activate tensorflow
(tensorflow) pip install selenium


Web Driver
Selenium ์ง€์‹œ๋ฅผ ๋ฐ›์•„ ์›น๋ธŒ๋ผ์šฐ์ €๋ฅผ ์ด์šฉํ•˜์—ฌ ๋™์ ํŽ˜์ด์ง€ ์ œ์–ดํ•˜๋Š” ํ”„๋กœ๊ทธ๋žจ

ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ๋‹ค์šด๋กœ๋“œ(chromedriver.exe)
๋‹จ๊ณ„1: ์‚ฌ์šฉ์ž ํฌ๋กฌ ๋ธŒ๋ผ์šฐ์ € ๋ฒ„์ „ ํ™•์ธ
๋‹จ๊ณ„2: ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ๋‹ค์šด๋กœ๋“œ https://chromedriver.chromium.org/downloads


์—˜๋ฆฌ๋จผํŠธ(element)
์›น๋ฌธ์„œ๋ฅผ ์ž‘์„ฑํ•˜๋Š” ํƒœ๊ทธ(tag)
ํ˜•์‹) <์‹œ์ž‘ํƒœ๊ทธ ์†์„ฑ> ๋‚ด์šฉ </์ข…๋ฃŒํƒœ๊ทธ>
์˜ˆ) <a href=โ€˜http://www.naver.comโ€™ class=โ€˜a_linkโ€™> ๋„ค์ด๋ฒ„ </a>


Selenium Crawling
์—˜๋ฆฌ๋จผํŠธ ์ˆ˜์ง‘ Selenium ํ•จ์ˆ˜
find_element_by_class_name('class์ด๋ฆ„') # class ์†์„ฑ์˜ ์ด๋ฆ„์œผ๋กœ ์ฐพ๊ธฐ
find_element_by_id('id์ด๋ฆ„') # id ์†์„ฑ์˜ ์ด๋ฆ„์œผ๋กœ ์ฐพ๊ธฐ
find_element_by_name('name์ด๋ฆ„') # name ์†์„ฑ์˜ ์ด๋ฆ„์œผ๋กœ ์ฐพ๊ธฐ
find_element_by_tag_name('tag์ด๋ฆ„') # tag ์ด๋ฆ„์œผ๋กœ ์ฐพ๊ธฐ
find_element_by_link_text('text') # a ํƒœ๊ทธ์˜ ํ…์Šค๋กœ ์ฐพ๊ธฐ
find_element_by_css_selector('css_selector') # ์„ ํƒ์ž๋กœ ์ฐพ๊ธฐ(. or #)
find_element_by_xpath('xpath') # tag ์ ˆ๋Œ€๊ฒฝ๋กœ or ์ƒ๋Œ€๊ฒฝ๋กœ ์ฐพ๊ธฐ

1) ๋ฒ„ํŠผ ํด๋ฆญํ•˜๊ธฐ ์˜ˆ
browser.get("https://naver.com") # url ์ด๋™
elem = browser.find_element_by_class_name("link_login") # ๋ฒ„ํŠผ element
elem.click() # ๋ฒ„ํŠผ์„ ๋ˆ„๋ฆ„
browser.back() # ํŽ˜์ด์ง€ ๋’ค๋กœ ์ด๋™
browser.forward() # ํŽ˜์ด์ง€ ์•ž์œผ๋กœ ์ด๋™
browser.refresh() # ํŽ˜์ด์ง€ ์ƒˆ๋กœ๊ณ ์นจ(F5)

2) ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ ๋ฐ ๊ฒฐ๊ณผ ๊ฒ€์ƒ‰ ์˜ˆ
1. ๋Œ€์ƒ url

driver.get("https://www.google.com/") # ๊ตฌ๊ธ€ ํŽ˜์ด์ง€ ์ด๋™


2. name ์†์„ฑ์œผ๋กœ element ๊ฐ€์ ธ์˜ค๊ธฐ

elem = driver.find_element_by_name("q") # 1๊ฐœ element ์ˆ˜์ง‘


3. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ -> ์—”ํ„ฐ

elem.send_keys("์…€๋ ˆ๋ฆฌ์›€ ํฌ๋กค๋ง") elem.send_keys(Keys.ENTER) # ๊ฒ€์ƒ‰๊ฒฐ๊ณผ ํŽ˜์ด์ง€ ์ด๋™


3) Selenium์ด์šฉ ์…€๋Ÿฝ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘
1. Google ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ ์ ‘์†

# ํฌ๋กฌ ๋“œ๋ผ์ด๋ฒ„ ์ƒ์„ฑ Driver = webdriver.Chrome() #google ์ด๋ฏธ์ง€๊ฒ€์ƒ‰ url ์ ‘์† driver.get("https://www.google.co.kr/imghp?h1=ko&tab=wi&ogb1")


2. Google ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ์ƒ์ž ์ฐพ๊ธฐ & ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ๊ณผ ์ฐพ๊ธฐ ๋ฒ„ํŠผ ํด๋ฆญ

# ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ์ƒ์ž : name์†์„ฑ element ์ฐพ๊ธฐ elem = driver.find_element_by_name("q") # ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ ๋ฐ ์—”ํ„ฐ elem.send_keys('ํ•˜์ •์šฐ') # ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ elem.send_keys(Keys.RETURN) # ์—”ํ„ฐํ‚ค๋ˆ„๋ฆ„


3. ์ž‘์€ ์ด๋ฏธ์ง€ ์ „์ฒด element ์ˆ˜์ง‘

# class์ด๋ฆ„์œผ๋กœ element ์ฐพ๊ธฐ images = driver.find_elements_by_class_name("rg_i.Q4LuWd")


4. ์ž‘์€ ์ด๋ฏธ์ง€ ํด๋ฆญ -> ํฐ ์ด๋ฏธ์ง€ save

# ์ž‘์€ ์ด๋ฏธ์ง€ ํด๋ฆญ -> ํฐ ์ด๋ฏธ์ง€ save for image in images : image.click() # ์ž‘์€ ์ด๋ฏธ์ง€ ํด๋ฆญ -> ํฐ ์ด๋ฏธ์ง€ ๋‚˜ํƒ€๋‚จ # ํฐ ์ด๋ฏธ์ง€ url ํš๋“ : copy full Xpath ๋‹จ์ถ• ๋ฉ”๋‰ด ์ด์šฉ imageUrl = driver.find_element_by_xpath("/html/body/div[2]/c-wiz/div[3]/div[2]/div[3]/div/div/div[3]/div[2]/c-wiz/div/div[1]/div[1]/div[2]/div/a/img").get_attribute("src") # ํ˜„์žฌ ํด๋”์œ„์น˜์— image ์ €์žฅ urlretrieve(imageUrl, 'image'+str(cnt)+".jpg")





button click

1. naver page ์ด๋™
2. login ๋ฒ„ํŠผ ํด๋ฆญ
3. ํ™”๋ฉด ์ „ํ™˜

from selenium import webdriver #module import time #ํ™”๋ฉด ์ผ์‹œ ์ •์ง€



1. driver ๊ฐ์ฒด ์ƒ์„ฑ

path = r"C:\ITWILL\5_Tensorflow\workspace" driver = webdriver.Chrome(path + '/chromedriver.exe') dir(driver)

'find_element', : 1๊ฐœ element ์ˆ˜์ง‘
'find_elements', : n๊ฐœ element ์ˆ˜์ง‘
'get' : ํŠน์ • url ์ด๋™
'forward' : ํŽ˜์ด์ง€ ์•ž์œผ๋กœ ์ด๋™
'back' : ํŽ˜์ด์ง€ ๋’ค๋กœ ์ด๋™



2. ๋Œ€์ƒ url ์ด๋™

driver.get('https://www.naver.com/') #url ์ด๋™




3. ๋กœ๊ทธ์ธ ๋ฒ„ํŠผ element ๊ฐ€์ ธ์˜ค๊ธฐ
copy element : <a href="https://nid.naver.com/nidlogin.login?mode=form&amp;url=https%3A%2F%2Fwww.naver.com" class="link_login" data-clk="log_off.login"><i class="ico_naver"><span class="blind">๋„ค์ด๋ฒ„</span></i>๋กœ๊ทธ์ธ</a>

1) class name์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ
login_ele = driver.find_element_by_class_name("link_login")
login_ele.click() # ๋ฒ„ํŠผ ํด๋ฆญ
time.sleep(2) # 2์ดˆ ์ผ์‹œ ์ค‘์ง€

2) xpath๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ
copy xpath : ์ƒ๋Œ€๊ฒฝ๋กœ - //*[@id="account"]/a
copy full xpath : ์ ˆ๋Œ€๊ฒฝ๋กœ - /html/body/div[2]/div[3]/div[3]/div/div[2]/a

์ƒ๋Œ€๊ฒฝ๋กœ : ์ง์ „์ƒ์œ„ํƒœ๊ทธ/ํ˜„์žฌํƒœ๊ทธ ๊ฒฝ๋กœ
์ ˆ๋Œ€๊ฒฝ๋กœ : ์‹œ์ž‘ํƒœ๊ทธ๋ถ€ํ„ฐ ํ˜„์žฌ ํƒœ๊ทธ ๊ฒฝ๋กœ

login_ele2 = driver.find_element_by_xpath('//*[@id="account"]/a') login_ele2.click() #๋ฒ„ํŠผ ํด๋ฆญ time.sleep(2) #2์ดˆ ์ผ์‹œ ์ค‘์ง€ driver.back() #ํ˜„์žฌํŽ˜์ด์ง€ -> ์ด์ „์œผ๋กœ time.sleep(2) #2์ดˆ ์ผ์‹œ ์ค‘์ง€ driver.forward() #์ด์ „ -> ์•ž์œผ๋กœ driver.refresh() #ํŽ˜์ด์ง€ ์ƒˆ๋กœ๊ณ ์นจ(F5) time.sleep(2) #2์ดˆ ์ผ์‹œ ์ค‘์ง€ driver.close() #ํ˜„์žฌ ์ฐฝ ๋‹ซ๊ธฐ





text input

์ž…๋ ฅ์ƒ์ž -> ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ -> [๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ ์ด๋™] -> element ์ˆ˜์ง‘

from selenium import webdriver #driver ์ƒ์„ฑ from selenium.webdriver.common.keys import Keys #์—”ํ„ฐํ‚ค ์—ญํ• 



def keyword_search(keyword) : # 1. driver ๊ฐ์ฒด ์ƒ์„ฑ path = r"C:\ITWILL\5_Tensorflow\workspace" driver = webdriver.Chrome(path + '/chromedriver.exe') # 2. ๋Œ€์ƒ url ์ด๋™ driver.get('https://www.google.com/') # url ์ด๋™ # 3. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ์ƒ์ž : name ์†์„ฑ์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ ''' <input class="gLFyf gsfi" maxlength="2048" name="q" type="text" aria-autocomplete="both" aria-haspopup="false" autocapitalize="off" autocomplete="off" autocorrect="off" autofocus="" role="combobox" spellcheck="false" title="๊ฒ€์ƒ‰" value="" aria-label="๊ฒ€์ƒ‰" data-ved="0ahUKEwiJqbLLxuf0AhU4slYBHXMEBNQQ39UDCAY"> ''' input_ele = driver.find_element_by_name('q') # 1๊ฐœ element # 4. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ -> ์—”ํ„ฐ input_ele.send_keys(keyword) input_ele.send_keys(Keys.ENTER) # ์—”ํ„ฐํ‚ค ๋ˆ„๋ฆ„ -> ๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ ์ด๋™ # 5. ๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ element ์ˆ˜์ง‘ : tag ์ด๋ฆ„์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ a_elems = driver.find_elements_by_tag_name('a') # n๊ฐœ element ์ˆ˜์ง‘ : list๋ฐ˜ํ™˜ # 6. element ์†์„ฑ(href) ์ˆ˜์ง‘ : url urls = [] # url ์ €์žฅ for a in a_elems : url = a.get_attribute("href") # href ์†์„ฑ ๊ฐ’ ์ถ”์ถœ urls.append(url) # 7. element ๋‚ด์šฉ ์ˆ˜์ง‘ conts = [] for a in a_elems : conts.append(a.text) driver.close() # ์ฐฝ ๋‹ซ๊ธฐ return urls, conts
keyword = input('๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ : ') urls, conts = keyword_search(keyword) print(urls) print(conts)





text input

์ž…๋ ฅ์ƒ์ž -> ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ -> [๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ ์ด๋™] -> element ์ˆ˜์ง‘

from selenium import webdriver #driver ์ƒ์„ฑ from selenium.webdriver.common.keys import Keys #์—”ํ„ฐํ‚ค ์—ญํ• 

def keyword_search(keyword) : #1. driver ๊ฐ์ฒด ์ƒ์„ฑ path = r"C:\ITWILL\5_Tensorflow\workspace" driver = webdriver.Chrome(path + '/chromedriver.exe') #2. ๋Œ€์ƒ url ์ด๋™ driver.get('https://www.google.com/') #url ์ด๋™ #3. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ์ƒ์ž : name ์†์„ฑ์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ ''' <input class="gLFyf gsfi" maxlength="2048" name="q" type="text" aria-autocomplete="both" aria-haspopup="false" autocapitalize="off" autocomplete="off" autocorrect="off" autofocus="" role="combobox" spellcheck="false" title="๊ฒ€์ƒ‰" value="" aria-label="๊ฒ€์ƒ‰" data-ved="0ahUKEwiJqbLLxuf0AhU4slYBHXMEBNQQ39UDCAY"> ''' input_ele = driver.find_element_by_name('q') #1๊ฐœ element #4. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ -> ์—”ํ„ฐ input_ele.send_keys(keyword) input_ele.send_keys(Keys.ENTER) #์—”ํ„ฐํ‚ค ๋ˆ„๋ฆ„ -> ๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ ์ด๋™ #5. ๊ฒ€์ƒ‰ ํŽ˜์ด์ง€ element ์ˆ˜์ง‘ : tag ์ด๋ฆ„์œผ๋กœ ๊ฐ€์ ธ์˜ค๊ธฐ a_elems = driver.find_elements_by_tag_name('a') #n๊ฐœ element ์ˆ˜์ง‘ : list๋ฐ˜ํ™˜ #6. element ์†์„ฑ(href) ์ˆ˜์ง‘ : url urls = [] #url ์ €์žฅ for a in a_elems : url = a.get_attribute("href") #href ์†์„ฑ ๊ฐ’ ์ถ”์ถœ urls.append(url) #7. element ๋‚ด์šฉ ์ˆ˜์ง‘ conts = [] for a in a_elems : conts.append(a.text) driver.close() #์ฐฝ ๋‹ซ๊ธฐ return urls, conts
keyword = input('๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ : ') urls, conts = keyword_search(keyword) print(urls) print(conts)






movie review crawling

naver ์˜ํ™” review ํ…์ŠคํŠธ ์ˆ˜์ง‘
find_element_by : 1๊ฐœ element ์ˆ˜์ง‘
find_elements_by : n๊ฐœ element ์ˆ˜์ง‘ - list ๋ฐ˜ํ™˜

from selenium import webdriver #module import time #ํ™”๋ฉด ์ผ์‹œ ์ •์ง€



1. driver ๊ฐ์ฒด ์ƒ์„ฑ

path = r"C:\ITWILL\5_Tensorflow\workspace" driver = webdriver.Chrome(path + '/chromedriver.exe')




2. ๋Œ€์ƒ url ์ด๋™

driver.get('https://movie.naver.com/') #naver ์˜ํ™” ๊ฒ€์ƒ‰ url ์ด๋™




3. [ํ‰์ .๋ฆฌ๋ทฐ] ๋งํฌ ํด๋ฆญ : ์ ˆ๋Œ€๊ฒฝ๋กœ ์ด์šฉ 1๊ฐœ element ๊ฐ€์ ธ์˜ค๊ธฐ
<a href="/movie/point/af/list.naver" title="ํ‰์ ยท๋ฆฌ๋ทฐ" class="menu07"><strong>ํ‰์ ยท๋ฆฌ๋ทฐ</strong></a>

a_ele = driver.find_element_by_xpath('/html/body/div/div[3]/div/div[1]/div/div/ul/li[4]/a') a_ele.click() #a tag ํด๋ฆญ -> ํŽ˜์ด์ง€ ์ด๋™ print(driver.current_url) #ํ˜„์žฌ ํŽ˜์ด์ง€ url ์ถœ๋ ฅ

https://movie.naver.com/movie/point/af/list.naver -> base url
https://movie.naver.com/movie/point/af/list.naver?&page=1 -> query : base?&page={n}
https://movie.naver.com/movie/point/af/list.naver?&page=2
https://movie.naver.com/movie/point/af/list.naver?&page=3



4. ์˜ํ™”์ œ๋ชฉ, ํ‰์ , ๋ฆฌ๋ทฐ ์ˆ˜์ง‘ : 1page(10๊ฐœ)

title_txt = [] #์˜ํ™”์ œ๋ชฉ ์ €์žฅ star_txt = [] #ํ‰์  ์ €์žฅ cont_txt = [] #๋ฆฌ๋ทฐ ์ €์žฅ for n in range(1, 21) : #20page ์ˆ˜์ง‘ url = f"https://movie.naver.com/movie/point/af/list.naver?&page={n}" driver.get(url) #page ๋ฒˆํ˜ธ ์ด๋™ time.sleep(1) #1์ดˆ ์ผ์‹œ ์ •์ง€ #1) ์˜ํ™”์ œ๋ชฉ ์ €์žฅ : copy xpath ''' //*[@id="old_content"]/table/tbody/tr[1]/td[2]/a[1] - 1๋ฒˆ //*[@id="old_content"]/table/tbody/tr[2]/td[2]/a[1] - 2๋ฒˆ //*[@id="old_content"]/table/tbody/tr[10]/td[2]/a[1] - 10๋ฒˆ //*[@id="old_content"]/table/tbody/tr/td[2]/a[1] - ์˜ํ™”์ œ๋ชฉ ํŒจํ„ด ''' titles = driver.find_elements_by_xpath('//*[@id="old_content"]/table/tbody/tr/td[2]/a[1]') for title in titles : title_txt.append(title.text) print(title_txt) #2) ํ‰์  ์ €์žฅ ''' //*[@id="old_content"]/table/tbody/tr[1]/td[2]/div/em //*[@id="old_content"]/table/tbody/tr[2]/td[2]/div/em //*[@id="old_content"]/table/tbody/tr/td[2]/div/em - ํ‰์  ํŒจํ„ด ''' stars = driver.find_elements_by_xpath('//*[@id="old_content"]/table/tbody/tr/td[2]/div/em') for star in stars : star_txt.append(star.text) print(star_txt) #3) ๋ฆฌ๋ทฐ ์ €์žฅ ''' //*[@id="old_content"]/table/tbody/tr[1]/td[2] //*[@id="old_content"]/table/tbody/tr[2]/td[2] //*[@id="old_content"]/table/tbody/tr[3]/td[2] //*[@id="old_content"]/table/tbody/tr/td[2] - ๋ฆฌ๋ทฐ ํŒจํ„ด ''' conts = driver.find_elements_by_xpath('//*[@id="old_content"]/table/tbody/tr/td[2]') for cont in conts : #print(cont.text) # ํ•˜์œ„ element ๋ชจ๋“  text ์ถœ๋ ฅ ''' ์ŠคํŒŒ์ด๋”๋งจ: ๋…ธ ์›จ์ด ํ™ˆ : ์ œ๋ชฉ[0] ๋ณ„์  - ์ด 10์  ์ค‘ : ๋ณ„์ [1] 10 : ํ‰์ [2] ์ค‘๊ตญ๋ฌธํ™”์„ ์ „๋ฌผ, ๊ฒŒ์ดBL๋ฌผ๋กœ ์ง€์ณ๊ฐ€๋˜ ๋งˆ๋ธ”ํŒฌ์—๊ฒŒ ๋“œ๋””์–ด ์‹ฌํ์†Œ์ƒ์ˆ  ์„ฑ๊ณต ์‹ ๊ณ  : ๋ฆฌ๋ทฐ[3] ''' txt_token = str(cont.text).split('\n') review = txt_token[3] cont_txt.append(review[:-3]) #'์‹ ๊ณ ' ์ œ์™ธ print(cont_txt) print('์ œ๋ชฉ ๊ฐœ์ˆ˜ :', len(title_txt)) #10 -> 200 print('ํ‰์  ๊ฐœ์ˆ˜ :', len(star_txt)) #10 -> 200 print('๋ฆฌ๋ทฐ ๊ฐœ์ˆ˜ :', len(cont_txt)) #10 -> 200 driver.close() #์ฐฝ ๋‹ซ๊ธฐ




5. file save
1) DataFrame

import pandas as pd df = pd.DataFrame({'title':title_txt,'star':star_txt, 'review':cont_txt}, columns=['title','star','review'])


2) csv file save

df.to_csv('movie_review.csv', index=False) print('file saved...')


3) csv file read

movie_review = pd.read_csv('movie_review.csv') movie_review.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 title 200 non-null object
1 star 200 non-null int64
2 review 189 non-null object

movie_review.head() movie_review.tail()





celeb image crawling

์…€๋Ÿฝ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘
Selenium + BeautifulSoup

(base) conda activate tensorflow
(tensorflow) pip install beautifulsoup4

from selenium import webdriver #๋™์  ํŽ˜์ด์ง€ ์ œ์–ด from bs4 import BeautifulSoup #์ •์  ํŽ˜์ด์ง€ ์ฒ˜๋ฆฌ from urllib.request import urlretrieve #server image -> local file save import numpy as np #์ค‘๋ณต image url ์ œ๊ฑฐ import os #ํด๋” ๊ด€๋ฆฌ(๊ฒฝ๋กœ, ์ƒ์„ฑ, ์ด๋™)

def celeb_img_crawler(name) : #1. driver ๊ฐ์ฒด ์ƒ์„ฑ path = r"C:\ITWILL\5_Tensorflow\workspace" driver = webdriver.Chrome(path + '/chromedriver.exe') #2. ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ url ์ด๋™ driver.get('https://www.google.co.kr/imghp?hl=ko&tab=ri&ogbl') #3. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ์ƒ์ž : name์†์„ฑ ๊ฐ€์ ธ์˜ค๊ธฐ search_box = driver.find_element_by_name("q") search_box.send_keys(name) # ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ #4. [์ฐพ๊ธฐ] ๋ฒ„ํŠผ ํด๋ฆญ : copy xpath : //*[@id="sbtc"]/button search_btn = driver.find_element_by_xpath('//*[@id="sbtc"]/button') search_btn.click() #๋ฒ„ํŠผ ํด๋ฆญ driver.implicitly_wait(3) #3์ดˆ ๋Œ€๊ธฐ(์ž์› loading) #5. ์ด๋ฏธ์ง€ ํฌํ•จํ•˜๋Š” div ํƒœ๊ทธ ์ˆ˜์ง‘ -> image ํƒœ๊ทธ ์ˆ˜์ง‘ #1) ์ด๋ฏธ์ง€ ํฌํ•จํ•˜๋Š” div ํƒœ๊ทธ ์ˆ˜์ง‘ image_url = [] for i in range(50) : #0 ~ 49 src = driver.page_source #ํ˜„์žฌ ํŽ˜์ด์ง€ ์†Œ์Šค ์ˆ˜์ง‘ html = BeautifulSoup(src, 'html.parser') #html ํŒŒ์‹ฑ div_img = html.select_one(f'div[data-ri="{i}"]') #'tag[์†์„ฑ="๊ฐ’"]' #2) image ํƒœ๊ทธ ์ˆ˜์ง‘ img_tag = div_img.select_one('img[class="rg_i Q4LuWd"]') #img element try : #img tag -> src ์†์„ฑ ๊ฐ’ ์ถ”์ถœ -> list ์ €์žฅ image_url.append(img_tag.attrs['src']) print(str(i+1) + '๋ฒˆ์งธ image url ์ถ”์ถœ') except : print(str(i+1) + '๋ฒˆ์งธ image url ์—†์Œ') #6. ์ค‘๋ณต image url ์ œ๊ฑฐ print(len(image_url)) # 28 image_url = np.unique(image_url) print(len(image_url)) # 28 #7. image ์ €์žฅ ํด๋”(dir) ์ƒ์„ฑ & ์ด๋™ pwd = os.getcwd() #C:\ITWILL\5_Tensorflow\workspace\chap06_Selenium_Crawling\lecture/name os.mkdir(name) #ํ˜„์žฌ ์œ„์น˜์— ํด๋” ์ƒ์„ฑ(์…€๋Ÿฝ์ด๋ฆ„) os.chdir(pwd +'/'+ name) #ํด๋” ์ด๋™ #8. image_url -> file save for i in range(len(image_url)) : #0 ~ 27 try : file_name = "test"+ str(i+1)+".jpg" #test1.jpg ~ test50.jpg urlretrieve(image_url[i], file_name) #file save print(str(i+1) + '๋ฒˆ์งธ image ์ €์žฅ') except : print('ํ•ด๋‹น url์— image ์—†์Œ :', image_url[i]) os.chdir(pwd) #์ฒ˜์Œ ์œ„์น˜ ์ด๋™(๋‹ค์Œ ์…€๋Ÿฝ ์ €์žฅ) driver.close() #์ฐฝ ๋‹ซ๊ธฐ


ํ•จ์ˆ˜ ํ˜ธ์ถœ test
ex) celeb_img_crawler("์ฐจ์ธํ‘œ")

์—ฌ๋Ÿฌ๋ช… ์…€๋Ÿฝ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘

nameList = ["์‹ฌ์ž์œค", "์†กํ˜œ๊ต", "๊ฐ•๋™์›"] #48, 36, 32 for name in nameList : celeb_img_crawler(name) #3ํšŒ ํ˜ธ์ถœ





celeb image crawling scrolling

์…€๋Ÿฝ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘
Selenium + BeautifulSoup

(base) conda activate tensorflow
(tensorflow) pip install beautifulsoup4


from selenium import webdriver #๋™์  ํŽ˜์ด์ง€ ์ œ์–ด from bs4 import BeautifulSoup #์ •์  ํŽ˜์ด์ง€ ์ฒ˜๋ฆฌ from urllib.request import urlretrieve #server image -> local file save import numpy as np #์ค‘๋ณต image url ์ œ๊ฑฐ import os #ํด๋” ๊ด€๋ฆฌ(๊ฒฝ๋กœ, ์ƒ์„ฑ, ์ด๋™)

def celeb_img_crawler(name) : #1. driver ๊ฐ์ฒด ์ƒ์„ฑ path = r"C:\ITWILL\5_Tensorflow\workspace" driver = webdriver.Chrome(path + '/chromedriver.exe') #2. ์ด๋ฏธ์ง€ ๊ฒ€์ƒ‰ url ์ด๋™ driver.get('https://www.google.co.kr/imghp?hl=ko&tab=ri&ogbl') #3. ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ์ƒ์ž : name์†์„ฑ ๊ฐ€์ ธ์˜ค๊ธฐ search_box = driver.find_element_by_name("q") search_box.send_keys(name) # ๊ฒ€์ƒ‰์–ด ์ž…๋ ฅ #4. [์ฐพ๊ธฐ] ๋ฒ„ํŠผ ํด๋ฆญ : copy xpath : //*[@id="sbtc"]/button search_btn = driver.find_element_by_xpath('//*[@id="sbtc"]/button') search_btn.click() #๋ฒ„ํŠผ ํด๋ฆญ driver.implicitly_wait(3) #3์ดˆ ๋Œ€๊ธฐ(์ž์› loading) # ------------ ์Šคํฌ๋กค๋ฐ” ๋‚ด๋ฆฌ๊ธฐ ------------------------------------------------------ last_height = driver.execute_script("return document.body.scrollHeight") #ํ˜„์žฌ ์Šคํฌ๋กค ๋†’์ด ๊ณ„์‚ฐ while True: # ๋ฌดํ•œ๋ฐ˜๋ณต # ๋ธŒ๋ผ์šฐ์ € ๋๊นŒ์ง€ ์Šคํฌ๋กค๋ฐ” ๋‚ด๋ฆฌ๊ธฐ driver.execute_script("window.scrollTo(0, document.body.scrollHeight);") time.sleep(2) # 2์ดˆ ๋Œ€๊ธฐ - ํ™”๋ฉด ์Šคํฌ๋กค ํ™•์ธ # ํ™”๋ฉด ๊ฐฑ์‹ ๋œ ํ™”๋ฉด์˜ ์Šคํฌ๋กค ๋†’์ด ๊ณ„์‚ฐ new_height = driver.execute_script("return document.body.scrollHeight") # ์ƒˆ๋กœ ๊ณ„์‚ฐํ•œ ์Šคํฌ๋กค ๋†’์ด์™€ ๊ฐ™์œผ๋ฉด stop if new_height == last_height: try: # [๊ฒฐ๊ณผ ๋”๋ณด๊ธฐ] : ์—†๋Š” ๊ฒฝ์šฐ ์žˆ์Œ - ์˜ˆ์™ธ์ฒ˜๋ฆฌ driver.find_element_by_class_name("mye4qd").click() # class์ด๋ฆ„์œผ๋กœ element์ฐพ๊ธฐ #driver.find_element_by_css_selector(".mye4qd").click() # ์„ ํƒ์ž(.class์ด๋ฆ„) except: break last_height = new_height # ์ƒˆ๋กœ ๊ณ„์‚ฐํ•œ ์Šคํฌ๋กค ๋†’์ด๋กœ ๋Œ€์ฒด #------------------------------------------------------------------------- #5. ์ด๋ฏธ์ง€ ํฌํ•จํ•˜๋Š” div ํƒœ๊ทธ ์ˆ˜์ง‘ -> image ํƒœ๊ทธ ์ˆ˜์ง‘ #1) ์ด๋ฏธ์ง€ ํฌํ•จํ•˜๋Š” div ํƒœ๊ทธ ์ˆ˜์ง‘ image_url = [] for i in range(50) : #0 ~ 49 src = driver.page_source #ํ˜„์žฌ ํŽ˜์ด์ง€ ์†Œ์Šค ์ˆ˜์ง‘ html = BeautifulSoup(src, 'html.parser') #html ํŒŒ์‹ฑ div_img = html.select_one(f'div[data-ri="{i}"]') #'tag[์†์„ฑ="๊ฐ’"]' #2) image ํƒœ๊ทธ ์ˆ˜์ง‘ img_tag = div_img.select_one('img[class="rg_i Q4LuWd"]') #img element try : #img tag -> src ์†์„ฑ ๊ฐ’ ์ถ”์ถœ -> list ์ €์žฅ image_url.append(img_tag.attrs['src']) print(str(i+1) + '๋ฒˆ์งธ image url ์ถ”์ถœ') except : print(str(i+1) + '๋ฒˆ์งธ image url ์—†์Œ') #6. ์ค‘๋ณต image url ์ œ๊ฑฐ print(len(image_url)) # 28 image_url = np.unique(image_url) print(len(image_url)) # 28 #7. image ์ €์žฅ ํด๋”(dir) ์ƒ์„ฑ & ์ด๋™ pwd = os.getcwd() #C:\ITWILL\5_Tensorflow\workspace\chap06_Selenium_Crawling\lecture/name os.mkdir(name) #ํ˜„์žฌ ์œ„์น˜์— ํด๋” ์ƒ์„ฑ(์…€๋Ÿฝ์ด๋ฆ„) os.chdir(pwd +'/'+ name) #ํด๋” ์ด๋™ #8. image_url -> file save for i in range(len(image_url)) : #0 ~ 27 try : file_name = "test"+ str(i+1)+".jpg" #test1.jpg ~ test50.jpg urlretrieve(image_url[i], file_name) #file save print(str(i+1) + '๋ฒˆ์งธ image ์ €์žฅ') except : print('ํ•ด๋‹น url์— image ์—†์Œ :', image_url[i]) os.chdir(pwd) #์ฒ˜์Œ ์œ„์น˜ ์ด๋™(๋‹ค์Œ ์…€๋Ÿฝ ์ €์žฅ) driver.close() #์ฐฝ ๋‹ซ๊ธฐ


ํ•จ์ˆ˜ ํ˜ธ์ถœ test
ex) celeb_img_crawler("์ฐจ์ธํ‘œ")

์—ฌ๋Ÿฌ๋ช… ์…€๋Ÿฝ ์ด๋ฏธ์ง€ ์ˆ˜์ง‘

nameList = ["์‹ฌ์ž์œค", "์†กํ˜œ๊ต", "๊ฐ•๋™์›"] #48, 48, 48 for name in nameList : celeb_img_crawler(name) #3ํšŒ ํ˜ธ์ถœ

+ Recent posts