Text Classification Cn

中文文本分类实践,基于搜狗新闻语料库,采用传统机器学习方法以及预训练模型等方法
Alternatives To Text Classification Cn
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Tensorflow175,280327775 hours ago46October 23, 20192,141apache-2.0C++
An Open Source Machine Learning Framework for Everyone
Transformers102,914649116 hours ago91June 21, 2022742apache-2.0Python
🤗 Transformers: State-of-the-art Machine Learning for Pytorch, TensorFlow, and JAX.
Pytorch67,5381465 hours ago23August 10, 202212,187otherPython
Tensors and Dynamic neural networks in Python with strong GPU acceleration
Keras58,5213307 hours ago68May 13, 2022390apache-2.0Python
Deep Learning for humans
Cs Video Courses56,273
5 days ago17
List of Computer Science courses with video lectures.
Faceswap44,904
21 days ago24gpl-3.0Python
Deepfakes Software For All
D2l Zh44,047
15 days ago45March 25, 202234apache-2.0Python
《动手学深度学习》:面向中文读者、能运行、可讨论。中英文版被60多个国家的400多所大学用于教学。
Tensorflow Examples42,312
8 months ago218otherJupyter Notebook
TensorFlow Tutorial and Examples for Beginners (support TF v1 & v2)
100 Days Of Ml Code40,344
3 months ago61mit
100 Days of ML Coding
Deepfacelab39,560
14 days ago531gpl-3.0Python
DeepFaceLab is the leading software for creating deepfakes.
Alternatives To Text Classification Cn
Select To Compare


Alternative Project Comparisons
Readme

Text Classification

Text Classification

:P9M4

GitHubChinese Word Vectors Sogou News 300d

Part 1: scikit-learn

scikit-learn

  1. :TF-IDF

1.

C000008``.20061127.zip``CN_Corpus

CN_Corpus
SogouC.reduced
    Reduced
        C000008
        C000010
        C000013
        C000014
        C000016
        C000020
        C000022
        C000023
        C000024
category_labels = {
    'C000008': '_08_Finance',
    'C000010': '_10_IT',
    'C000013': '_13_Health',
    'C000014': '_14_Sports',
    'C000016': '_16_Travel',
    'C000020': '_20_Education',
    'C000022': '_22_Recruit',
    'C000023': '_23_Culture',
    'C000024': '_24_Military'
}

80%20%

data
test
  _08_Finance
  _10_IT
  _13_Health
  _14_Sports
  _16_Travel
  _20_Education
  _22_Recruit
  _23_Culture
  _24_Military
train
    _08_Finance
    _10_IT
    _13_Health
    _14_Sports
    _16_Travel
    _20_Education
    _22_Recruit
    _23_Culture
    _24_Military

2.

(X_train_data)(y_train)(X_test_data)(y_test)

X_train_data, y_train, X_test_data, y_test = load_datasets()
label: _08_Finance, len: 1500
label: _10_IT, len: 1500
label: _13_Health, len: 1500
label: _14_Sports, len: 1500
label: _16_Travel, len: 1500
label: _20_Education, len: 1500
label: _22_Recruit, len: 1500
label: _23_Culture, len: 1500
label: _24_Military, len: 1500
train corpus len: 13500

label: _08_Finance, len: 490
label: _10_IT, len: 490
label: _13_Health, len: 490
label: _14_Sports, len: 490
label: _16_Travel, len: 490
label: _20_Education, len: 490
label: _22_Recruit, len: 490
label: _23_Culture, len: 490
label: _24_Military, len: 490
test corpus len: 4410
X_train_data[1000]
'                                                                                                        '
y_train[1000]
'_08_Finance'

wordcloud_example

3. :TF-IDF

-TF-IDFTF-IDFTF-IDFTF-IDFTF-IDFTF-IDF

TfidfVectorizer``TF-IDFTokenizing

stopwords = open('dict/stop_words.txt', encoding='utf-8').read().split()

# TF-IDF feature extraction
tfidf_vectorizer = TfidfVectorizer(stop_words=stopwords)
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train_data)
words = tfidf_vectorizer.get_feature_names()
X_train_tfidf.shape
(13500, 223094)
len(words)
223094

4.

Benchmark:

scikit-learnMultinomialNB

classifier = MultinomialNB()
classifier.fit(X_train_tfidf, y_train)
MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

tfidf_vectorizer``transform``fit_transform``TF-IDF``predict

news_lastest = ["360360360360360360360",
                "50",
                "519"]
X_new_data = [preprocess(doc) for doc in news_lastest]
X_new_data
['                                                                                   ',
 '                                       ',
 '                                                                   ']
X_new_tfidf = tfidf_vectorizer.transform(X_new_data)
predicted  = classifier.predict(X_new_tfidf)
predicted
array(['_08_Finance', '_20_Education', '_24_Military'], dtype='<U13')

5.

84.35%Benchmarkclassification_report

print(classification_report(y_test, predicted))
               precision    recall  f1-score   support

  _08_Finance       0.88      0.88      0.88       488
       _10_IT       0.72      0.87      0.79       403
   _13_Health       0.82      0.84      0.83       478
   _14_Sports       0.95      1.00      0.97       466
   _16_Travel       0.86      0.92      0.89       455
_20_Education       0.71      0.87      0.79       401
  _22_Recruit       0.91      0.65      0.76       690
  _23_Culture       0.80      0.77      0.79       513
 _24_Military       0.94      0.89      0.92       516

     accuracy                           0.84      4410
    macro avg       0.84      0.86      0.84      4410
 weighted avg       0.85      0.84      0.84      4410
confusion_matrix(y_test, predicted)
array([[429,  12,  15,  10,   5,   3,   6,   4,   4],
       [ 20, 352,   6,   3,   5,   2,  10,   4,   1],
       [  3,  38, 403,   0,   6,  16,   5,   7,   0],
       [  0,   1,   0, 464,   0,   0,   0,   1,   0],
       [  5,  11,   0,   0, 419,   5,   2,  12,   1],
       [  4,  11,   5,   1,   2, 350,  13,  14,   1],
       [ 22,  21,  57,   8,  14,  87, 448,  32,   1],
       [  3,  25,   4,   4,  34,  23,   5, 394,  21],
       [  4,  19,   0,   0,   5,   4,   1,  22, 461]])

Logistic Regression

Logistic Regression

text_clf_lr = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', LogisticRegression()),
])
               precision    recall  f1-score   support

  _08_Finance       0.87      0.91      0.89       465
       _10_IT       0.77      0.86      0.81       440
   _13_Health       0.91      0.82      0.86       546
   _14_Sports       0.98      0.99      0.98       483
   _16_Travel       0.90      0.90      0.90       488
_20_Education       0.79      0.91      0.85       429
  _22_Recruit       0.86      0.85      0.85       495
  _23_Culture       0.86      0.75      0.80       556
 _24_Military       0.95      0.92      0.93       508

     accuracy                           0.88      4410
    macro avg       0.88      0.88      0.88      4410
 weighted avg       0.88      0.88      0.88      4410

Benchmark

array([[425,  11,   7,   1,   2,   4,   9,   5,   1],
       [ 23, 377,   4,   3,   9,   3,  12,   7,   2],
       [  9,  41, 447,   0,   6,  23,  10,  10,   0],
       [  0,   2,   0, 478,   0,   1,   0,   1,   1],
       [  8,  12,   0,   0, 440,   8,   3,  16,   1],
       [  1,   6,   2,   0,   1, 389,  20,  10,   0],
       [  8,   7,  22,   1,   5,  26, 420,   6,   0],
       [ 11,  23,   8,   6,  24,  30,  15, 419,  20],
       [  5,  11,   0,   1,   3,   6,   1,  16, 465]])

SVM

SVMLR

text_clf_svm = Pipeline([
    ('vect', TfidfVectorizer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2')),
])
               precision    recall  f1-score   support

  _08_Finance       0.87      0.92      0.90       463
       _10_IT       0.78      0.85      0.81       446
   _13_Health       0.93      0.82      0.87       558
   _14_Sports       0.99      0.99      0.99       488
   _16_Travel       0.91      0.91      0.91       489
_20_Education       0.82      0.92      0.86       437
  _22_Recruit       0.89      0.85      0.87       513
  _23_Culture       0.85      0.83      0.84       503
 _24_Military       0.96      0.91      0.93       513

     accuracy                           0.89      4410
    macro avg       0.89      0.89      0.89      4410
 weighted avg       0.89      0.89      0.89      4410



array([[428,  11,   5,   1,   5,   3,   5,   8,   0],
       [ 23, 382,   4,   0,   6,   8,  11,   7,   2],
       [  9,  39, 453,   0,   7,  19,  10,   9,   0],
       [  0,   1,   0, 485,   0,   2,   0,   1,   0],
       [  7,  14,   0,   0, 449,   7,   3,  14,   1],
       [  3,   7,   1,   1,   1, 403,  14,  15,   0],
       [ 12,   8,  22,   0,   4,  25, 437,   8,   0],
       [  4,  15,   5,   3,  15,  17,   9, 411,  19],
       [  4,  13,   0,   0,   3,   6,   1,  17, 468]])

Part 2: CNN-Keras

  1. Keras

1.

texts, labels = load_raw_datasets()
0 C000008 Finance [1, 0, 0, 0, 0, 0, 0, 0, 0]
1 C000010 IT [0, 1, 0, 0, 0, 0, 0, 0, 0]
2 C000013 Health [0, 0, 1, 0, 0, 0, 0, 0, 0]
3 C000014 Sports [0, 0, 0, 1, 0, 0, 0, 0, 0]
4 C000016 Travel [0, 0, 0, 0, 1, 0, 0, 0, 0]
5 C000020 Education [0, 0, 0, 0, 0, 1, 0, 0, 0]
6 C000022 Recruit [0, 0, 0, 0, 0, 0, 1, 0, 0]
7 C000023 Culture [0, 0, 0, 0, 0, 0, 0, 1, 0]
8 C000024 Military [0, 0, 0, 0, 0, 0, 0, 0, 1]

2.

1 2 300

364180 300
0.003146 0.582671 0.049029 -0.312803 0.522986 0.026432 -0.097115 0.194231 -0.362708 ...... ......

embeddings_index = load_pre_trained()
Found 364180 word vectors, dimension 300

3. Keras

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
import numpy as np

MAX_SEQUENCE_LEN = 1000  # 
MAX_WORDS_NUM = 20000  # 
VAL_SPLIT_RATIO = 0.2 # 

tokenizer = Tokenizer(num_words=MAX_WORDS_NUM)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print(len(word_index)) # all token found
# print(word_index.get('')) # get word index
dict_swaped = lambda _dict: {val:key for (key, val) in _dict.items()}
word_dict = dict_swaped(word_index) # swap key-value
data = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LEN)

labels_categorical = to_categorical(np.asarray(labels))
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels_categorical.shape)

indices = np.arange(data.shape[0])
np.random.shuffle(indices)
data = data[indices]
labels_categorical = labels_categorical[indices]

# split data by ratio
val_samples_num = int(VAL_SPLIT_RATIO * data.shape[0])

x_train = data[:-val_samples_num]
y_train = labels_categorical[:-val_samples_num]
x_val = data[-val_samples_num:]
y_val = labels_categorical[-val_samples_num:]

word_index``word_index2000020000

len(data[data>=20000])
0
# convert from index to origianl doc
for w_index in data[0]:
    if w_index != 0:
        print(word_dict[w_index], end=' ')
category_labels[dict_swaped(labels_index)[argmax(labels_categorical[0])]]
'_20_Education'

4.

MAX_WORDS_NUM``(MAX_WORDS_NUM, EMBEDDING_DIM)``i``word_index``i``pad_sequence``02000092.35%

EMBEDDING_DIM = 300 # embedding dimension
embedding_matrix = np.zeros((MAX_WORDS_NUM+1, EMBEDDING_DIM)) # row 0 for 0
for word, i in word_index.items():
    embedding_vector = embeddings_index.get(word)
    if i < MAX_WORDS_NUM:
        if embedding_vector is not None:
            # Words not found in embedding index will be all-zeros.
            embedding_matrix[i] = embedding_vector
nonzero_elements = np.count_nonzero(np.count_nonzero(embedding_matrix, axis=1))
nonzero_elements / MAX_WORDS_NUM
0.9235

Embedding Layer

sequenceKerasTokenizer API``Embedding3

  • input_dim``MAX_WORDS_NUM + 1
  • output_dim``EMBEDDING_DIM
  • input_length``MAX_SEQUENCE_LEN

weights=[embedding_matrix]``trainable=False``Embedding

5.

Keras

  • Sequential
  • Functional APIAPI

Flatten(Dense)

from keras.models import Sequential
from keras.layers import Dense, Flatten

input_dim = x_train.shape[1]

model1 = Sequential()
model1.add(Embedding(input_dim=MAX_WORDS_NUM+1, 
                    output_dim=EMBEDDING_DIM, 
                    input_length=MAX_SEQUENCE_LEN))
model1.add(Flatten())
model1.add(Dense(64, activation='relu', input_shape=(input_dim,)))
model1.add(Dense(64, activation='relu'))
model1.add(Dense(len(labels_index), activation='softmax'))

model1.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history1 = model1.fit(x_train, 
                    y_train,
                    epochs=30,
                    batch_size=128,
                    validation_data=(x_val, y_val))
Train on 14328 samples, validate on 3582 samples
Epoch 1/30
14328/14328 [==============================] - 59s 4ms/step - loss: 3.1273 - acc: 0.2057 - val_loss: 1.9355 - val_acc: 0.2510
Epoch 2/30
14328/14328 [==============================] - 56s 4ms/step - loss: 2.0853 - acc: 0.3349 - val_loss: 1.8037 - val_acc: 0.3473
Epoch 3/30
14328/14328 [==============================] - 56s 4ms/step - loss: 1.7210 - acc: 0.4135 - val_loss: 1.2498 - val_acc: 0.5731
......
Epoch 29/30
14328/14328 [==============================] - 56s 4ms/step - loss: 0.5843 - acc: 0.8566 - val_loss: 1.3564 - val_acc: 0.6516
Epoch 30/30
14328/14328 [==============================] - 56s 4ms/step - loss: 0.5864 - acc: 0.8575 - val_loss: 0.5970 - val_acc: 0.8501

Keras

  • layer.get_weights()``numpy array
  • layer.set_weights(weights)``numpy array``numpy array``layer.get_weights()
embedding_custom = model1.layers[0].get_weights()[0]
embedding_custom
array([[ 0.39893672, -0.9062594 ,  0.35500282, ..., -0.73564297,
         0.50492775, -0.39815223],
       [ 0.10640696,  0.18888871,  0.05909824, ..., -0.1642032 ,
        -0.02778293, -0.15340094],
       [ 0.06566656, -0.04023357,  0.1276007 , ...,  0.04459211,
         0.08887506,  0.05389333],
       ...,
       [-0.12710813, -0.08472785, -0.2296919 , ...,  0.0468552 ,
         0.12868881,  0.18596107],
       [-0.03790742,  0.09758633,  0.02123675, ..., -0.08180046,
         0.10254312,  0.01284804],
       [-0.0100647 ,  0.01180602,  0.00446023, ...,  0.04730382,
        -0.03696882,  0.00119566]], dtype=float32)

get_weights``get_config()

model1.layers[0].get_config()
{'activity_regularizer': None,
 'batch_input_shape': (None, 1000),
 'dtype': 'float32',
 'embeddings_constraint': None,
 'embeddings_initializer': {'class_name': 'RandomUniform',
  'config': {'maxval': 0.05, 'minval': -0.05, 'seed': None}},
 'embeddings_regularizer': None,
 'input_dim': 20001,
 'input_length': 1000,
 'mask_zero': False,
 'name': 'embedding_13',
 'output_dim': 300,
 'trainable': True}
plot_history(history1)

acc_loss_model1

3030epoch

from keras.models import Sequential
from keras.layers import Dense, Flatten

input_dim = x_train.shape[1]

model2 = Sequential()
model2.add(Embedding(input_dim=MAX_WORDS_NUM+1, 
                    output_dim=EMBEDDING_DIM, 
                    weights=[embedding_matrix],
                    input_length=MAX_SEQUENCE_LEN,
                    trainable=False))
model2.add(Flatten())
model2.add(Dense(64, activation='relu', input_shape=(input_dim,)))
model2.add(Dense(64, activation='relu'))
model2.add(Dense(len(labels_index), activation='softmax'))

model2.compile(optimizer='rmsprop',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

history2 = model2.fit(x_train, 
                    y_train,
                    epochs=10,
                    batch_size=128,
                    validation_data=(x_val, y_val))
Train on 14328 samples, validate on 3582 samples
Epoch 1/10
14328/14328 [==============================] - 37s 3ms/step - loss: 1.3124 - acc: 0.6989 - val_loss: 0.7446 - val_acc: 0.8088
Epoch 2/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.2831 - acc: 0.9243 - val_loss: 0.5712 - val_acc: 0.8551
Epoch 3/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.1183 - acc: 0.9704 - val_loss: 0.6261 - val_acc: 0.8624
Epoch 4/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0664 - acc: 0.9801 - val_loss: 0.6897 - val_acc: 0.8607
Epoch 5/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0549 - acc: 0.9824 - val_loss: 0.7199 - val_acc: 0.8660
Epoch 6/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0508 - acc: 0.9849 - val_loss: 0.7261 - val_acc: 0.8582
Epoch 7/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0513 - acc: 0.9865 - val_loss: 0.8251 - val_acc: 0.8585
Epoch 8/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0452 - acc: 0.9858 - val_loss: 0.7891 - val_acc: 0.8707
Epoch 9/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0469 - acc: 0.9865 - val_loss: 0.8663 - val_acc: 0.8680
Epoch 10/10
14328/14328 [==============================] - 35s 2ms/step - loss: 0.0418 - acc: 0.9867 - val_loss: 0.9048 - val_acc: 0.8640
plot_history(history2)

acc_loss_model2

KerasCNN

from keras.layers import Dense, Input, Embedding
from keras.layers import Conv1D, MaxPooling1D, Flatten
from keras.models import Model

embedding_layer = Embedding(input_dim=MAX_WORDS_NUM+1,
                            output_dim=EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LEN,
                            trainable=False)


sequence_input = Input(shape=(MAX_SEQUENCE_LEN,), dtype='int32')
embedded_sequences = embedding_layer(sequence_input)
x = Conv1D(128, 5, activation='relu')(embedded_sequences)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(5)(x)
x = Conv1D(128, 5, activation='relu')(x)
x = MaxPooling1D(35)(x)  # global max pooling
x = Flatten()(x)
x = Dense(128, activation='relu')(x)
preds = Dense(len(labels_index), activation='softmax')(x)

model3 = Model(sequence_input, preds)
model3.compile(loss='categorical_crossentropy',
              optimizer='rmsprop',
              metrics=['acc'])

history3 = model3.fit(x_train, 
                    y_train,
                    epochs=6,
                    batch_size=128,
                    validation_data=(x_val, y_val))
Train on 14328 samples, validate on 3582 samples
Epoch 1/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.9943 - acc: 0.6719 - val_loss: 0.5129 - val_acc: 0.8582
Epoch 2/6
14328/14328 [==============================] - 76s 5ms/step - loss: 0.4841 - acc: 0.8571 - val_loss: 0.3929 - val_acc: 0.8841
Epoch 3/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.3483 - acc: 0.8917 - val_loss: 0.4022 - val_acc: 0.8724
Epoch 4/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.2763 - acc: 0.9100 - val_loss: 0.3441 - val_acc: 0.8942
Epoch 5/6
14328/14328 [==============================] - 76s 5ms/step - loss: 0.2194 - acc: 0.9259 - val_loss: 0.3014 - val_acc: 0.9107
Epoch 6/6
14328/14328 [==============================] - 77s 5ms/step - loss: 0.1749 - acc: 0.9387 - val_loss: 0.3895 - val_acc: 0.8788
plot_history(history3)

acc_loss_model3_cnn

MaxPooling1D

Popular Machine Learning Projects
Popular Deep Learning Projects
Popular Machine Learning Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Machine Learning
Deep Learning
Nlp
Classification
Cnn
Benchmark
Keras
Corpus
Embeddings
Svm
Scikit Learn
Word2vec
Logistic Regression
Text Classification
Tf Idf