Python selenium 활용해서 webscrapper 기초 제작 3

5.14 Select

이전에 작성했던 코드 실행하면, indeed 로 접속해서 python 검색한 크롬창이 띄어진다.

마우스 우클릭 - inspect(검사)

from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)

base_url = 'https://kr.indeed.com/jobs?q='
search_term = 'python'

browser.get(f'{base_url}{search_term}')
soup = BeautifulSoup(browser.page_source,'html.parser')
job_list = soup.find('ul', class_='jobsearch-ResultsList')
jobs = job_list.find_all('li', recursive=False)
for job in jobs:
    zone = job.find('div', class_='mosaic-zone')
    if zone == None:
        print('job li')

    

while(True):
    pass

class 이름이 jobTitle인 h2안에 있는 anchor 를 확인할 수 있다.

가져오려는 데이터와 가장 일치하는 이름의 클래스를 가져오는 것이 핵심이다.

jobTitle 이라는 class 이름을 다른 곳에서 사용할 가능성이 매우 낮기 때문에 사용한다.

예를 들어 css1m4cuuf 라는 이름의 class 이름으로 가져오려 하면,

이 class 이름은 다른 곳에서 쓰였을 가능성이 있기 때문에,

job Title 처럼 구인 정보를 명확히 나타낼 수 있는 class 이름을 찾아서 데이터를 추출하는게 핵심이다.

( 가장 고유한 class 이름을 사용해 가져오는게 좋다, 내가 가지고 올 데이터를 설명해줄 수 있으니까 )

job Title class 의 h2 를 가져온다.

from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)

base_url = 'https://kr.indeed.com/jobs?q='
search_term = 'python'

browser.get(f'{base_url}{search_term}')
soup = BeautifulSoup(browser.page_source,'html.parser')
job_list = soup.find('ul', class_='jobsearch-ResultsList')
jobs = job_list.find_all('li', recursive=False)
for job in jobs:
    zone = job.find('div', class_='mosaic-zone')
    if zone == None:
        h2 = job.find('h2',class_='jobTitle')
        a = h2.find('a')

    
while(True):
    pass

h2 를 가져오게 되면 그 안의 a를 가져오고 싶다.

h2 = job.find('h2',class_='jobTitle') → jobTitle 이라는 class 이름으로 h2 를 찾는다. 그리고 h2 변수에 저장.

jobTitle을 찾았으니 anchor 를 찾는다.

왜냐하면 anchor 에 구인정보 링크가 있기 때문이다.

나중에 엑셀 파일에 이 링크를 저장하기 위해서, 또한 링크에 대한 label 도 쓰여 있다.

( label 에는 직책 이름이 있으며, 이렇게 하면 span 안으로 들어가지 않아도 된다. )

h2 에 들어가서 anchor 를 가져오면 된다, anchor안에 label이 있고

label을 가져올 수 있다면, href ( 링크 ) 또한 가져올 수 도 있다.

a = h2.find('a') → h2에서 a 태그를 찾아서 a 변수에 저장.

find, find.all 과 같은 메서드 말고도 python 에는 여러 메서드 존재한다.

find를 사용하는 대신, 이번에는 select 라는 걸 사용해본다. ( 코드를 절약하기 위해 )

select는 검색할 때 CSS selector 라는 것을 사용할 수 있게 해주기 때문에,

element 들을 다른 방법으로 검색할 수 있게 해준다.

from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)

base_url = 'https://kr.indeed.com/jobs?q='
search_term = 'python'

browser.get(f'{base_url}{search_term}')
soup = BeautifulSoup(browser.page_source,'html.parser')
job_list = soup.find('ul', class_='jobsearch-ResultsList')
jobs = job_list.find_all('li', recursive=False)
for job in jobs:
    zone = job.find('div', class_='mosaic-zone')
    if zone == None:
        anchor = job.select('h2 a')
        print(anchor)
        print('//////////////////')
    
while(True):
    pass

출력값:

구인정보에 필요한 a를 가진 리스트를 받았다.

anchor = job.select('h2 a') → h2를 선택한 다음, 안으로 들어가서 a를 가져와라, 그리고 anchor 변수에 저장.

< select 메서드 사용한 이유 >

h2 = job.find('h2',class_='jobTitle')
a = h2.find('a')

select 를 사용하여 위의 2줄 코드를 아래 1줄 코드로 절약했다.

anchor = job.select('h2 a')

나는 1개의 element 를 찾고 있기 때문에 job.select 가 아닌 job.select_one으로 코드를 수정한다.

from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)

base_url = 'https://kr.indeed.com/jobs?q='
search_term = 'python'

browser.get(f'{base_url}{search_term}')
soup = BeautifulSoup(browser.page_source,'html.parser')
job_list = soup.find('ul', class_='jobsearch-ResultsList')
jobs = job_list.find_all('li', recursive=False)
for job in jobs:
    zone = job.find('div', class_='mosaic-zone')
    if zone == None:
        anchor = job.select_one('h2 a')
        print(anchor)
        print('//////////////////')
    
while(True):
    pass

출력값:

1개의 anchor를 가진 리스트가 아닌, 1개의 anchor를 가져오게 한다.

이제 이 코드가 이 anchor에 성공적으로 접근할 수 있게 되었으니,

구인 게시물로 이동하는 링크인 href 을 추출하고,

aria-label 이라는 attribute(속성)을 추출한다.

aria-label attribute는 HTML element에 제공하는 속성으로,

화면 리더기가 읽게 하길 원하는 텍스트를 포함시켜준다.

화면 리더기 = 웹사이트의 화면을 읽는 프로그램

( ex: 시각 장애가 있는 사람들에게 페이지를 읽어주도록 하기 위함 )

HTML을 화면 리더기가 올바르게 읽을 수 있긴 원한다면,

aira-label property(속성)를 사용해야 한다.

이걸 사용해서 나에게 필요한 구인 정보를 추출한다.

beautifulsoup의 특성으로 코드에서 출력한 anchor는 HTML string이 아니라 딕셔너리 이다.

이전 게시물에서 언급했듯이 beautifulsoup는 내가 찾은 HTML 태그들을

리스트와 딕셔너리 형태의 데이터 구조로 변환시킨다.

from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)

base_url = 'https://kr.indeed.com/jobs?q='
search_term = 'python'

browser.get(f'{base_url}{search_term}')
soup = BeautifulSoup(browser.page_source,'html.parser')
job_list = soup.find('ul', class_='jobsearch-ResultsList')
jobs = job_list.find_all('li', recursive=False)
for job in jobs:
    zone = job.find('div', class_='mosaic-zone')
    if zone == None:
        anchor = job.select_one('h2 a')
        title = anchor['aria-label']
        link = anchor['href']
        print(title, link)
        print('////////\n////////')
    
while(True):
    pass

출력값:

title = anchor['aria-label']

→ anchor 의 key 값 aria-label 에서 value 값 ( 라인프렌즈 2022년 ~ ) 을 가져와서 title 변수에 저장.

link = anchor['href']

→ anchor 의 key 값 href 에서 value 값( /rc/clk?jk=1015~ ) 을 가져와서 link 변수에 저장.

링크는 상대경로만 있기 때문에 추후에 f 포맷팅으로 앞부분 url을 추가하여 전체 url 로 완성시킨다.

( /rc/clk?jk=1015~ → https://kr.indeed.com{link} )

위와 같이 가져올 수 있는 이유는

beautifulsoup의 특성으로 코드에서 출력한 anchor는 HTML string이 아니라 딕셔너리 이다.

위에서 언급했듯이 beautifulsoup는 내가 찾은 HTML 태그들을

리스트와 딕셔너리 형태의 데이터 구조로 변환시키기 때문에 위 코드로 추출이 가능하다.

위 구인정보가 위치한 지역이나 도시도 추출해서 출력해본다.

h2 jobTitle 을 닫고, div class가 company_location, companyInfo를 연다.

이전 weworkremotely 웹 스크래퍼 처럼 각각의 link, company, location(region), postion 을 key 값으로

job_data라는 딕셔너리에 넣고, for문 밖에 results 라는 빈 리스트를 만들고,

job_data에 데이터를 추출할때마다 results 리스트에 append 하여 출력해준다.

( 이전 게시물1, 게시물2 참고 )

from requests import get
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

options = Options()
options.add_experimental_option('excludeSwitches', ['enable-logging'])
browser = webdriver.Chrome(options=options)

base_url = 'https://kr.indeed.com/jobs?q='
search_term = 'python'

browser.get(f'{base_url}{search_term}')
results = []
soup = BeautifulSoup(browser.page_source,'html.parser')
job_list = soup.find('ul', class_='jobsearch-ResultsList')
jobs = job_list.find_all('li', recursive=False)
for job in jobs:
    zone = job.find('div', class_='mosaic-zone')
    if zone == None:
        anchor = job.select_one('h2 a')
        title = anchor['aria-label']
        link = anchor['href']
        company = job.find('span',class_='companyName')
        location = job.find('div',class_='companyLocation')
        job_data = {
            'link':f'https://kr.indeed.com{link}',
            'company': company.string,
            'location': location.string,
            'position': title
        }
        results.append(job_data)
for result in results:
    print(result,'\n////////\n')
        
while(True):
    pass

출력값:

< 발생가능 에러 >

혹시나 위와 같은 에러가 발생한다면, title.string 이라는 코드를 작성해서 그런건데,

title은 이미 title = anchor['aria-label'] 이 코드로 인해 텍스트만 나온 상황이라 .string이 되지 않는다.

( .string을 사용하면 태그 안에 있는 텍스트를 준다. )

<span class='title'>(Senior) Python Full Stack Software Developer</span> 태그와 텍스트를

(Senior) Python Full Stack Software Developer 이런 텍스트로만 준다.

title = anchor['aria-label']  코드로 텍스트만 출력된 title

이미 이전에 title = anchor['aria-label'] 코드로 텍스트만 출력되었다.

title.string → title 로 변경해서 코드를 작성하면 에러가 발생하지 않는다.

job_data = {
            'link':f'https://kr.indeed.com/{link}',
            'company': company.string,
            'location': location.string,
            'position': title
           }

python 웹 스크래퍼 참고 강의

https://nomadcoders.co/python-for-beginners/lobby

selenium 참고 강의

https://nomadcoders.co/selenium-for-beginners

'Programming > Python 웹 스크래퍼 만들기' 카테고리의 다른 글

Python selenium 활용해서 webscrapper 기초 제작 5 (0)	2022.12.05
Python selenium 활용해서 webscrapper 기초 제작 4 (0)	2022.12.02
Python None (0)	2022.11.30
Python selenium 활용해서 webscrapper 기초 제작 2 (0)	2022.11.29
Python selenium 활용해서 webscrapper 기초 제작 1 (1)	2022.11.28

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

Python selenium 활용해서 webscrapper 기초 제작 3

'Programming > Python 웹 스크래퍼 만들기' 카테고리의 다른 글

'Programming > Python 웹 스크래퍼 만들기' 카테고리의 다른 글

티스토리툴바

개인정보

단축키

내 블로그

블로그 게시글

모든 영역