[NLP] 텍스트전처리

토큰화 작업

nltta 자연어 처리 툴킷

영어권 문장 처리 작업 가능, 코랩에는 기본적으로 있음

pip list

konlpy 다운

pip install konlpy

konlpy 사용

import konlpy

konlpy.__version__

from konlpy.tag import Okt
okt = Okt()
print(okt.morphs(u'단독입찰보다 복수입찰의 경우'))
print(okt.morphs(u'아버지가방에들어가신다'))

nltk: 영어 코퍼스 토큰화 도구

from nltk.tokenize import word_tokenize, WordPunctTokenizer, sent_tokenize
#word_tokenize: 워드단위로 토큰화해주는 함수
#WordPunctTokenizer: 워드단위로 토큰화 함수
#sent_tokenize: 문장 단위로 토큰화 함수

import nltk
nltk.download('punkt')

# 단순히 공백으로만 분리하는 것이 아니라 어퍼스트로피의 경우도 고려해서 분리
#어퍼스트로피를 다른 것과 함께 분리
print("단어 토큰화 결과: ", word_tokenize("Tommy's And that's exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in."))

# 클래스라 객체먼저 생성
#객체가 가진 함수 호출해야함.
#어퍼스트로피를 따로 구분해준다.
print("단어 토큰화 결과: ", WordPunctTokenizer().tokenize("Tommy's And that's exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in."))

#문장 단위 토큰화
#마침표와 느낌표 단위로 문장 분리
data = "Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge."
print("문장 토큰화 결과: ", sent_tokenize(data)) #리스트 구조로 분리된 문장 출력

from nltk.tag import pos_tag
#pos_tag: 품사 단위로 태깅

import nltk
nltk.download('averaged_perceptron_tagger')

#각 단어의 품사 구분
#품사 구분 사이트: https://happygrammer.github.io/nlp/postag-set/
print("단어 토큰화 결과: ", word_tokenize("Tommy's And that's exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in."))
print("단어 토큰화 -> 품사 출력")
res = word_tokenize("Tommy's And that's exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.")
pos_tag(res)

kss

pip install kss

import kss
#kss: korean sentence: 한국어 문장이 주어지면 split하는 도구 = 토큰화도구, 문장단위로 토큰화
#일반적으로 단어 단위로 토큰화를 한다.
#문장사이의 관계, 중요한 문장 뽑아내는 것은 문장단위로 토큰화한다.

text = "여름입니다. 날씨가 덥습니다! 딥러닝을 공부합니다. 네?"

kss.split_sentences(text)

Okt, Kkma 사용

from konlpy.tag import Okt, Kkma

okt = Okt()
kkma = Kkma()

#추출하고자 하는 대상만 지정하여 출력할 수 있다. -> Korean POS tags comparison chart 참고
#okt
#형태소 분석기의 성능이 완벽하지 않음.
print("Okt : ", okt.morphs("NLP를 열심히 공부하고, 취업에 성공합시다"))#품사 분류
print("Okt : ", okt.pos("NLP를 열심히 공부하고, 취업에 성공합시다"))#어떤 품사인지 출력
print("Okt : ", okt.nouns("NLP를 열심히 공부하고, 취업에 성공합시다"))#명사만 출력

#kkma
print("kkma : ", kkma.morphs("NLP를 열심히 공부하고, 취업에 성공합시다"))#품사 분류
print("kkma : ", kkma.pos("NLP를 열심히 공부하고, 취업에 성공합시다"))#어떤 품사인지 출력
print("kkma : ", kkma.nouns("NLP를 열심히 공부하고, 취업에 성공합시다"))#명사만 출력

학습해도 효과가 없기때문에 단어의 빈도수가 낮거나. 길이가 매우 짧은 단어는 상황에 따라 제거 고려

정규표현식

import re #정규표현식 패키지
text="Tommy's Don't And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in."

pat = re.compile(r'\W*\b\w{1,2}\b')

pat.sub('',text)
# 's, 't, ., In, to : 단어의 길이가 2글자이하인 것 + 홑따옴표 + 부정관사가 제거되었음.

불용어 제거

from nltk.corpus import stopwords
#stopwords:불용어가 들어있는 서브패키지

#영향이 미미한 단어들을 뽑아서 만든 것
stopwords.words("english")

sw = set(stopwords.words("english"))#중복제거해서 불용어로 설정된 단어 살펴보기

#내가 입력한 문장에서 불용어 단어 제거하기

#1단계 단어 단위 토큰화
wt = word_tokenize(text)
res=[]

#2단계 불용어 단어 제거하기
for w in wt:#리스트에 있는 토큰하나하나
  if w not in sw:#내가 만든 단어 토큰과 불용어 사전의 단어가 일치하지 않을때
    res.append(w)

print(wt)#불용어 제거 전
print(len(wt))

print(res)#불용어 제거 후
print(len(res))

stopword 파일: https://gist.github.com/sebleier/554280

NLTK's list of english stopwords

GitHub Gist: instantly share code, notes, and snippets.

gist.github.com

#직접 불용어 설정해서 제거하기
text="NLP를 열심히 공부하고, 취업에 성공합시다"
sw="를 에 고 라고 다"

sw = sw.split(" ")
sw

wt = okt.morphs(text)
res = [w for w in wt if not w in sw] #wt변수에 있는 형태소 단위 분리된것이 w에 담고 그것이 sw에 없으면-> 불용어 아님

wt #단어 단위로 토큰화

res #불용어 제거

한국어 불용어 사전 링크: https://deep.chulgil.me/hangugeo-bulyongeo-riseuteu/

한국어 불용어 리스트 (Stopword)

언어 분석시 의미가 있는 단어와 의미가 없는 단어나 조사 등이 있다. 그중에서 의미가 없는 것을 stopwords라고 한다. 데이터 분석을 하는 것에 있어서는 큰 도움이 되지 않는 단어들이기 때문에

deep.chulgil.me

인코딩

원핫인코딩 : 핫은 1로 나머지는 0으로 표현하는 코드화 기법
단어 -> 정수화(index를 가지고 표현)
good hi hello -> 0 1 2
=> 단어의 종류 3가지 - 100 010 001( 타겟만 1로 핫하게 표현 나머지는 0으로)

text = """Tokenization is a key (and mandatory) aspect of working with text data
We’ll discuss the various nuances of tokenization, including how to handle Out-of-Vocabulary words (OOV)
Language is a thing of beauty. But mastering a new language from scratch is quite a daunting prospect. If you’ve ever picked up a language that wasn’t your mother tongue, you’ll relate to this! There are so many layers to peel off and syntaxes to consider – it’s quite a challenge.
And that’s exactly the way with our machines. In order to get our computer to understand any text, we need to break that word down in a way that our machine can understand. That’s where the concept of tokenization in Natural Language Processing (NLP) comes in.
Simply put, we can’t work with text data if we don’t perform tokenization. Yes, it’s really that important!
"""

sents = sent_tokenize(text)#문장단위 토큰화

sw = stopwords.words("english") #불용어 제거하기
result = [word for word in sents if not word in sw]

sents #문장 단위로 나뉘어져있음.
len(sents)

vocab = {} #{}자료구조 형태: 집합이거나 딕셔너리
pre_sents = []

for s in sents:
  wt = word_tokenize(s)#각 문장이 단어 단위로 토큰화
  res = []

  #한문장 분석 시작
  for w in wt:
    #주의사항: 영어는 대소문자가 구분된다. 텍스트 전처리에서 대소문자 통일하는 과정을 반드시 거쳐야함.
   w = w.lower()

   if w not in sw: #불용어가 아니다.

    if len(w)>2: #길이가 2보다 크다
      res.append(w)

      if w not in vocab:
        vocab[w] = 0 #key: w, value: 0  - 단어의 빈도수 구함.
                     #res에 저장한 값이 vocab에 없으면 key: w, value: 0 형태로 추가하기

      vocab[w] += 1 #w변수에 담긴 단어가 불용어가 아니고 & 길이가 2보다 크고 & vocab에 이미 있으면 value 증가시킴
  #한 문장 분석 끝

  pre_sents.append(res)#한문장에 대해 (불용어가 아니고 & 길이가 2보다 큰 것)을 하나의 리스트로 묶어서 문장단위로 추가

print(pre_sents)

vocab['and']#and 단어의 빈도수 출력

#빈도수 내림차순정렬하기
vs = sorted(vocab.items(), key = lambda x:x[1], reverse=True)
            #조작할 데이터, 정렬할 기준 지정, 내림차순/오름차순
                            #lambda x:x[1]: x에 단어와 빈도수 전달 / 0번째 key, 1번째 value

#빈도수가 2이상인 단어에 대해서 index를 부여
word_index={}
a = 0

for w, f in vs:
  if f >=2:
    a += 1
    word_index[w] = a

word_index#이것으로 찾은 단어들을 기준으로 원핫인코딩을 모두 해주어야함.

1. 직접 인코딩 만들기

word_index['OOV'] = len(word_index)+1 # Out Of Vocab : 코퍼스에 없는 단어 -> len(word_index)+1으로 value 입력

enc_sents = []
for s in pre_sents: #단어들만 추출되어 있음. s = 리스트[i]가 하나씩들어감
  enc_sent = []
  for w in s:#w = 리스트[i][j]가 들어감.
    try:
      enc_sent.append(word_index[w]) #w단어가 없어서 키를 못찾아 에러 발생
    except KeyError:
      enc_sent.append(word_index['OOV'])#단어가 없을때 예외처리를 한다.
  enc_sents.append(enc_sent)

print(enc_sents)

데이터 수집 (크롤링) 도구: selenium, beautifulsoup4 등

시각화 도구: matplotlib.plotly, tableau, seaborn, folium

데이터 분석도구: numpy, pandas

머신러닝 도구: scikit-learn

딥러닝 프레임워크: 텐서플로우(케라스), 파이토치, 까페 등

2. 케라스로 인코딩 하기

from tensorflow.keras.preprocessing.text import Tokenizer

#이중리스트 구조인 데이터 활용해서 원핫인코딩 하기
pre_sents

tok = Tokenizer()
tok.fit_on_texts(pre_sents)

#단어에 대해 숫자로 변환해주는 작업
tok.word_index

#단어별로 빈도수
tok.word_counts

tok.fit_on_texts(pre_sents) #모든 단어에 대해 번호 부여

tok.texts_to_sequences(pre_sents) #텍스트가 정수 시퀀스로 변환

패딩

자연어처리 모델은 반드시 모든 문장의 길이를 통일

pre_sentences = [['driver', 'person'], ['driver', 'good', 'person'], ['driver', 'huge', 'person'], ['knew', 'bad'], ['bad', 'kept', 'huge', 'bad'], ['huge', 'bad']]

1. 직접 패딩하기

#단어 정수 인코딩: 각각 단어에 대해 정수 부여
tok = Tokenizer()
tok.fit_on_texts(pre_sentences) #모든 단어에 대해 번호 부여

tok.texts_to_sequences(pre_sentences) #텍스트가 정수 시퀀스로 변환

#패딩: 길이를 동일하게 맞춰주기
#단어: PAD, 길이: 0
encoded = tok.texts_to_sequences(pre_sentences)

encoded

maxlen = max(len(i) for i in encoded)#단어가 인코딩된 숫자가 저장된 리스트 중 최대 길이

#모든 리스트에 최대길이로 padding하기

for s in encoded:
  while len(s) < maxlen:
      s.append(0)

encoded #padding 적용되엇는지 확인

import numpy as np

np.array(encoded)#array 형으로

2. 케라스로 패딩하기

from tensorflow.keras.preprocessing.sequence import pad_sequences

encoded = tok.texts_to_sequences(pre_sentences) #인코딩 함수
encoded

padded = pad_sequences(encoded, padding='post', maxlen=10)#패딩 함수, pre가 기본, 지정한 max길이
padded

padded4 = pad_sequences(encoded, maxlen=4)
padded4

padded2post = pad_sequences(encoded, maxlen=2, truncating='post') #truncating='post': 0제외 앞쪽 데이터 먼저
                                                              #maxlen=2 : 이 크기보다 초과하는 것은 자른다.
padded2post

padded2pre = pad_sequences(encoded, maxlen=2, truncating='pre') #truncating='post': 0제외 뒤쪽 데이터 먼저
                                                              #maxlen=2 : 이 크기보다 초과하는 것은 자른다.
padded2pre

원핫인코딩

padded2pre = pad_sequences(encoded, maxlen=2, truncating='pre')
truncating='post': 0제외 뒤쪽 데이터 먼저
maxlen=2 : 이 크기보다 초과하는 것은 자른다.

tokens = okt.morphs("자연어 처리 공부를 합니다.")
tokens

text = "점심 메뉴로 소고기볶음밥 먹었습니다. 소고기볶음밥 너무 맛있어요. 또 먹을래요."

tok = Tokenizer()
tok.fit_on_texts([text])

tok.word_index #각 단어에 정수 부여

test = "내일 메뉴로 소고기볶음밥 또 나왔으면 좋겠다."

tok.texts_to_sequences([test])
#text 문장에 대해 부여된 정수값을 기준으로 -> test문장을 인코딩
#새로운 문장을 tok을 참조해(=코퍼스를 토크나이징을한 것을 바탕으로) 인코딩

encoded = tok.texts_to_sequences([test])[0]
encoded

from tensorflow.keras.utils import to_categorical

o_v = to_categorical(encoded)#원핫인코딩
o_v

'자연어처리(NLP) & CHAT GPT > NLP' 카테고리의 다른 글

[NLP] 신경망기반 텍스트 분류 - 개념 (0)	2023.08.01
[NLP] 언어 모델링 - 실습 (0)	2023.07.31
[NLP] 중요한 정규표현식 (0)	2023.07.30
[NLP] 언어 모델링 - 개념 (0)	2023.07.29
[NLP] 텍스트전처리 - 개념 (0)	2023.07.27

[NLP] 텍스트전처리 - 실습

토큰화 작업

nltta 자연어 처리 툴킷

konlpy 다운

konlpy 사용

nltk: 영어 코퍼스 토큰화 도구

kss

Okt, Kkma 사용

정규표현식

불용어 제거

인코딩

1. 직접 인코딩 만들기

2. 케라스로 인코딩 하기

패딩

1. 직접 패딩하기

2. 케라스로 패딩하기

원핫인코딩

'자연어처리(NLP) & CHAT GPT > NLP' 카테고리의 다른 글

티스토리툴바

[NLP] 텍스트전처리 - 실습

토큰화 작업

nltta 자연어 처리 툴킷

konlpy 다운

konlpy 사용

nltk: 영어 코퍼스 토큰화 도구

kss

Okt, Kkma 사용

정규표현식

불용어 제거

인코딩

1. 직접 인코딩 만들기

2. 케라스로 인코딩 하기

패딩

1. 직접 패딩하기

2. 케라스로 패딩하기

원핫인코딩

'자연어처리(NLP) & CHAT GPT > NLP' 카테고리의 다른 글

관련글

티스토리툴바