[NLP] LSTM

texts_to_matrix를 이용해서 텍스트 정수로 변환

from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Embedding, SimpleRNN, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.preprocessing.sequence import pad_sequences
import pandas as pd
import numpy as np
from string import punctuation

texts = ['자연어 처리 알고리즘',
         '자연어 처리 방법',
         '자연어 NLP 알고리즘 알고리즘',
         '자연어 처리 전문가']

#토큰화 작업할 수 있는 객체 만들기
tok = Tokenizer() #클래스(설계도), 객체(건물)
tok.fit_on_texts(texts)

tok.index_word #각 단어에 대해 정수 부여

tok.texts_to_matrix(texts)
#텍스트 입력시 메트릭스로 변환
#해당 문장에 존재하는 단어를 1로 표시
#default 값은 mode = 'binary'

tok.texts_to_matrix(texts, mode = 'binary')

tok.texts_to_matrix(texts, mode = 'count') #해당 단어 빈도 출력

tf: 단어의 빈도수
idf: 몇개의 문서에서 등장
tfidf: 단어의 빈도가 높지만 여러개의 문서에서는 등장하지 않음

tok.texts_to_matrix(texts, mode = 'tfidf')

tok.texts_to_matrix(texts, mode = 'freq')
#한 문장에서 각 단어의 빈도를 비율로 나타냄
#분모: 전체 단어의 빈도수 합
#분자: 각 단어의 빈도수
#mode = 'freq' 잘 안쓰인다. 나머지 3개는 잘쓰임

LSTM

단어의 시퀀스가 시계열 구조

headline에 저장된 뉴스 기사 제목으로 다음 단어를 생성하는 LSTM 기반 모델 설계

동작 예시)
입력: I, 생성하고자 하는 단어의 개수
출력: I 로 시작하는 단어들 출력

1. 데이터 불러오기

from google.colab import drive
drive.mount('/content/drive')

df = pd.read_csv('/content/drive/MyDrive/NLP/NYT_2018.csv')
df

df.headline.isnull().sum()

#헤드라인 열에있는 값을 뽑아오기
df.headline #특정열하나 가져오면 자료구조: 시리즈

df.headline.values #자료구조: 어레이

list(df.headline.values) #자료구조: 리스트

headline = []
headline.extend(list(df.headline.values))

headline

len(headline) #1324

2. 전처리하기

Unknown 제거하기!

#'Unknown' 개수 카운트 방법1 #소요 시간이 더 짧음
sum(df.headline=='Unknown')

#'Unknown' 개수 카운트 방법2
len([w for w in headline if w == 'Unknown'])

#'Unknown' 개수 카운트 방법3
count = 0

for i in headline:
  if i == 'Unknown':
    count += 1

print(count)

#'Unknown'이 아닌 기사 개수 카운트
len([w for w in headline if w != 'Unknown'])

headline[300]

def pre_func(title):
  #소문자로 변환하기
  res=''.join(w.lower() for w in title if w not in punctuation)#문장 기호 제거하기

  return res

pre_headline = [pre_func(x) for x in headline if x not in "Unknown"]
#헤드라인을 하나씩 읽어서 x에 넣고
#pre_func함수에 전달해라
#return 값(소문자로 변환)이 있으면 리스트에 요소로 추가된다.

pre_headline

3. 코퍼스에서 Voc 만들기

(1)토크나이저 생성

tok = Tokenizer()

tok.fit_on_texts(pre_headline)

tok.index_word
len(tok.index_word) #3620

vocab_size = len(tok.index_word)+1

(2)정수변환 인코딩

sequences=[]
for s in pre_headline:
  #print(tok.texts_to_sequences([s])[0])
  enc = tok.texts_to_sequences([s])[0]#각 문장별 인코딩 #[]사용이유: pre_headline는 문자열이 들어가있는 리스트다. voc에 단어가 잇는지 없는지 확인해서 출력하기때문에
  for i in range(1,len(enc)): #각 문장 단위로 읽기
    seq = enc[:i+1]
    sequences.append(seq)

sequences[:15]

tok.word_index 결과를 tok.index_word처럼 만들어 주기

idx2word = {}
type(tok.word_index)
for k, v in tok.word_index.items():
  idx2word[v] = k #key와 value 자리 바꿔서 저장

idx2word #빈도수가 높은 것이 먼저 나온다. 1번이 빈도가 제일 높음.

4. 패딩

#가장 길이가 긴 문장 찾기
print(max(len(i) for i in sequences)) #24단어 문장이 가장 길다.
ml = max(len(i) for i in sequences)

#패딩: 길이 통일해주기
sequences = pad_sequences(sequences, maxlen=ml, padding='pre')

sequences

5. x, y 데이터 분리

LSTM 구조 기반 다음 단어 예측가능하게 x,y로 분리해주어야함
x: 입력단어에 대한 시퀀스
y: 다음에 출력해야하는 단어 시퀀스

lstm 모델로 단어를 예측하도록 설계합니다.
문장을 단어단위로 쪼개고
1 단어 입력 2 단어 출력
1~2 단어 입력 3 단어 출력
1~3 단어 입력 4 단어 출력
방식으로 설계

ex1)
입력 출력(입력에 대해 예측하여 출력)
lstm 모델로
lstm 모델로 단어를
...
lstm 모델로 단어를 예측하도록 설계합니다.

ex2)

[96, 264],	former nfl
[96, 264, 1101],	former nfl cheerleaders
[96, 264, 1101, 1102],	former nfl cheerleaders’ settlement
[96, 264, 1101, 1102, 573],	former nfl cheerleaders’ settlement offer
[96, 264, 1101, 1102, 573, 51],	former nfl cheerleaders’ settlement offer 1
[96, 264, 1101, 1102, 573, 51, 7],	former nfl cheerleaders’ settlement offer 1 and
[96, 264, 1101, 1102, 573, 51, 7, 2],	former nfl cheerleaders’ settlement offer 1 and a
[96, 264, 1101, 1102, 573, 51, 7, 2, 366],	former nfl cheerleaders’ settlement offer 1 and a meeting
[96, 264, 1101, 1102, 573, 51, 7, 2, 366, 11],	former nfl cheerleaders’ settlement offer 1 and a meeting with
[96, 264, 1101, 1102, 573, 51, 7, 2, 366, 11, 1103],	former nfl cheerleaders’ settlement offer 1 and a meeting with goodell

단순히 출력 단어를 가져오는 것 같지만
이 모델은 유사한 것을 입력해도 다음에 가져올 단어를 예측 할 수 있다.

단어가 발생하는 순서가 시간에 따라 발생되는 것이기때문에 시계열 데이터라고 볼 수 있다.
ex)
가장 먼저 등장한 단어: lstm
그다음: 모델로
...

#한 리스트에 적어도 2개의 단어는 있음. (x와 y값이 하나 이상씩 있어야하기때문에)
#문제: sequences 전체 문장에서 x는 0부터 n-1까지, y는 n만 분리하여 저장하시오.

x = [i[:-1] for i in sequences ]
y = [i[-1] for i in sequences ]

#LSTM에 들어가는 타입은 기본적으로 array
#list -> array
sequences = np.array(sequences)
x = sequences[:,:-1]
y = sequences[:,-1]

x.shape #(7809, 23)

6.sequences를 원핫인코딩하기

y=to_categorical(y, vocab_size)

y.shape #(7809, 3621)

vocab_size #3621 차원으로 이루어져있음.

7.모델 만들기

model = Sequential()
model.add(Embedding(vocab_size, 10)) #3621->10차원으로 줄이기
model.add(LSTM(128))
model.add(Dense(vocab_size, activation='softmax'))#모든 노드가 fully connected / 출력사이즈 vocab_size

model.summary()

model.compile(loss='categorical_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])

model.fit(x,y,epochs = 200, verbose=2)
#훈련할 데이터수 : 문장 수 / 문장 수를 한번 다 학습하면 epochs=1

입력단어 다음에 등장하는 단어를 10개를 예측하는 프로그램 만들기

모델 예측 생성 과정
입력 -> '나는'

요구사항 : '나는' 다음에 등장하는 단어를 10개를 예측해라
(모델입력) -> (모델 출력+ 다음 모델에 다시 입력)
나는 -> 모델 -> 나는 지금 -> 모델 -> 나는 지금 딥에 -> 모델

#다음 단어를 예측하는 모델 함수
def gen_sent(model, tok, c_word, n):
  pred_sent =""
  for _ in range(n):
    enc = tok.texts_to_sequences([c_word])[0]
    enc = pad_sequences([enc], maxlen=ml, padding='pre')
    #print(enc)

    res = model.predict(enc)
    res = np.argmax(res)#확률 최대값을 가지는 것의 인덱스 찾기

    for w, i in tok.word_index.items():#해당 인덱스의 단어 찾기
      if i == res:
        break
    #print("예측 단어: ", w)

    c_word = c_word + " " + w #예측된 단어 추가해주기
    #print(c_word)

  pred_sent = c_word
  return pred_sent

pred_sent = gen_sent(model, tok, 'what', 10)#모델, 토크나이저 객체, 입력단어, 예측 단어 개수

pred_sent

Tokenizer 관련 함수

tokenizer = Tokenizer(num_words = 100)
num_words를 단어 빈도수가 높은 순으로 100개만 사용한다는 의미이다. 나머지는 고려하지 않는다.

tokenizer.fit_on_texts(sentences)
fit_on_texts() 메서드는 문자 데이터를 입력받아서 리스트의 형태로 변환한다.

word_dic = tokenizer.word_index
tokenizer의 word_index 속성은 단어와 숫자의 키-값 쌍을 포함하는 딕셔너리를 반환한다. 이때, 반환 시 자동으로 소문자로 변환되어 들어간다. 그리고 느낌표나 마침표 같은 구두점은 자동으로 제거된다.

sequences = tokenizer.texts_to_sequences(sentences)
texts_to_sequences() 메서드는 텍스트 안의 단어들을 숫자의 시퀀스의 형태로 변환한다.

padded = pad_sequences(sequences)
pad_sequences() 함수에 이 시퀀스를 입력하면 숫자 0을 이용해서 같은 길이의 시퀀스로 변환한다. 가장 긴 시퀀스의 길이가 10이기 때문에 모두 같은 길이의 시퀀스를 포함하는 NumPy 어레이로 변환한 것을 볼 수 있다.

'자연어처리(NLP) & CHAT GPT > NLP' 카테고리의 다른 글

[NLP] 워드 임베딩(Word Embedding), RNN - 실습 (0)	2023.08.05
[NLP] 워드 임베딩(Word Embedding) - 개념 (0)	2023.08.04
[NLP] 신경망기반 텍스트 분류 - 실습 (0)	2023.08.02
[NLP] 신경망기반 텍스트 분류 - 개념 (0)	2023.08.01
[NLP] 언어 모델링 - 실습 (0)	2023.07.31