한국어 NLP 데이터셋과 언어모델(BERT)

들어가기에 앞서 (항상 업데이트 중..📡)

한국어 NLP는 영어에 비해 상대적으로 그 자료가 적은 것 같다. 하지만 처음 NLP 공부를 시작했을 때 이후로 수 많은 능력자분들께서 여러 한국어로 학습시킨 BERT 모델을 공개하고, NLP 커뮤니티가 점점 커져가는 것을 보면 괜스레 같은 배를 탄 느낌이 들어 반갑고 너무 감사드린다.

해당 웹페이지에 있는 것들을 그대로 가져와 정리한 것 뿐이며, 한국어 자연어처리 아카이브로 사용되었으면 한다.

Ⅰ. 한국어 BERT 모델

KorBERT (ETRI + Saltlux) http://aiopen.etri.re.kr/service_dataset.php
KoBERT (SKT) https://github.com/SKTBrain/KoBERT
KcBERT https://github.com/Beomi/KcBERT
HanBERT (TwoBlockAI) https://github.com/monologg/HanBert-Transformers
KoELECTRA https://github.com/monologg/KoELECTRA/

Ⅱ. 한국어 데이터셋

한국어 욕설 데이터 https://github.com/kocohub/korean-hate-speech
모두의 말뭉치 https://corpus.korean.go.kr/
한국어 데이터셋 모음 https://github.com/songys/AwesomeKorean_Data

Ⅲ. 한국어 처리

띄어쓰기 https://github.com/haven-jeon/PyKoSpacing
맞춤법 교정 https://github.com/ssut/py-hanspell

Ⅳ. 한국어 처리 관련 멋진 개발자분들

Hyunjoong Kim 님 (lovit) https://github.com/lovit
송영숙 님 https://github.com/songys
Jangwon Park (monologg) https://github.com/monologg

PREVIOUS에러 노트 (항상 업데이트중)

NEXT다시 공부할 것 모음