Extracting features from a text


Import required packages

[2]:
from oceanai.modules.lab.build import Run
2023-12-03 00:29:47.655916: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
INFO: Created TensorFlow Lite XNNPACK delegate for CPU.

Build

[3]:
_b5 = Run(
    lang = 'en', # Inference language
    color_simple = '#FFF', # Plain text color (hexadecimal code)
    color_info = '#1776D2', # The color of the text containing the information (hexadecimal code)
    color_err = '#FF0000', # Error text color (hexadecimal code)
    color_true = '#008001', # Text color containing positive information (hexadecimal code)
    bold_text = True, # Bold text
    num_to_df_display = 30, # Number of rows to display in tables
    text_runtime = 'Runtime', # Runtime text
    metadata = True # Displaying information about library
)

[2023-12-03 00:29:57] OCEANAI - personality traits:    Authors:        Elena Ryumina [ryumina_ev@mail.ru]        Dmitry Ryumin [dl_03.03.1991@mail.ru]        Alexey Karpov [karpov@iias.spb.su]    Maintainers:        Elena Ryumina [ryumina_ev@mail.ru]        Dmitry Ryumin [dl_03.03.1991@mail.ru]    Version: 1.0.0a5    License: BSD License

Loading a dictionary with hand-crafted features

[4]:
# Core setup
_b5.path_to_save_ = './models' # Directory to save the models
_b5.chunk_size_ = 2000000      # File download size from network in one step

res_load_text_features = _b5.load_text_features(
    force_reload = True,       # Forced download file
    out = True,                # Display
    runtime = True,            # Runtime calculation
    run = True                 # Run blocking
)

[2023-12-03 00:29:57] Loading a dictionary with hand-crafted features …

[2023-12-03 00:30:00] Loading the “LIWC2007.txt” file 100.0% …

— Runtime: 3.073 sec. —

Building tokenizer and translation model (RU -> EN)

[5]:
res_setup_translation_model = _b5.setup_translation_model(
    out = True,     # Display
    runtime = True, # Runtime calculation
    run = True      # Run blocking
)

[2023-12-03 00:30:00] Building tokenizer and translation model …

— Runtime: 3.098 sec. —

Building tokenizer and BERT model (for word encoding)

[6]:
# Core setup
_b5.path_to_save_ = './models' # Directory to save the models
_b5.chunk_size_ = 2000000      # File download size from network in one step

res_setup_translation_model = _b5.setup_bert_encoder(
    force_reload = True,       # Forced download file
    out = True,                # Display
    runtime = True,            # Runtime calculation
    run = True                 # Run blocking
)

[2023-12-03 00:30:04] Building tokenizer and BERT model …

[2023-12-03 00:30:07] Loading the “bert-base-multilingual-cased.zip” file**

[2023-12-03 00:30:04] Building tokenizer and BERT model …

[2023-12-03 00:30:07] Loading the “bert-base-multilingual-cased.zip” file**

[2023-12-03 00:30:07] Unzipping an archive “bert-base-multilingual-cased.zip” …

— Runtime: 14.752 sec. —

Process of extracting text features

Example 1 (Analyzing a video file (EN) with manual transcription)

[7]:
# Video file path
path = '/Users/dl/GitHub/OCEANAI/docs/source/user_guide/notebooks/glgfB3vFewc.004.mp4'

hc_features, nn_features = _b5.get_text_features(
    path = path, # Video file path
    asr = False, # Using a model for ASR
    lang = 'en', # Language selection for models trained on First Impressions V2 'en' and models trained on for MuPTA 'ru'
    show_text = True, # Text display
    out = True,       # Display
    runtime = True,   # Runtime calculation
    run = True        # Run blocking
)

[2023-12-03 00:30:18] Extraction of features (hand-crafted and deep) from a text …

**[2023-12-03 00:30:19] Statistics of extracted features from the text: **    Dimension of the matrix of hand-crafted features: 89 ✕ 64    Dimension of the matrix of deep features: 104 ✕ 768 Text:        during those times i feel sad i feel confused and

— Runtime: 0.343 sec. —

Example 2 (Analyzing a video file (EN) without manual transcription)

[8]:
# Video file path
path = '/Users/dl/GitHub/OCEANAI/docs/source/user_guide/notebooks/glgfB3vFewc.004.mp4'

hc_features, nn_features = _b5.get_text_features(
    path = path, # Video file path
    asr = True, # Using a model for ASR
    lang = 'en', # Language selection for models trained on First Impressions V2 'en' and models trained on for MuPTA 'ru'
    show_text = True, # Text display
    out = True,       # Display
    runtime = True,   # Runtime calculation
    run = True        # Run blocking
)

[2023-12-03 00:30:19] Extraction of features (hand-crafted and deep) from a text …

**[2023-12-03 00:30:25] Statistics of extracted features from the text: **    Dimension of the matrix of hand-crafted features: 89 ✕ 64    Dimension of the matrix of deep features: 104 ✕ 768 Text:        during those times i feel sad i feel confused and- the school and introduce them to our administrators and the different faculty that work throughout the school and the library and the gym and so on and then they can get comfortable if theyre in a new school as well

— Runtime: 6.398 sec. —

Example 3 (Analyzing a video file (RU) without manual transcription)

[9]:
# Video file path
path = '/Users/dl/GitHub/OCEANAI/docs/source/user_guide/notebooks/center_42.mov'

hc_features, nn_features = _b5.get_text_features(
    path = path, # Video file path
    asr = False, # Using a model for ASR
    lang = 'ru', # Language selection for models trained on First Impressions V2 'en' and models trained on for MuPTA 'ru'
    show_text = True, # Text display
    out = True,       # Display
    runtime = True,   # Runtime calculation
    run = True        # Run blocking
)

[2023-12-03 00:30:25] Extraction of features (hand-crafted and deep) from a text …

**[2023-12-03 00:30:43] Statistics of extracted features from the text: **    Dimension of the matrix of hand-crafted features: 365 ✕ 64    Dimension of the matrix of deep features: 414 ✕ 768 Text:        на картинке изображены скорее всего друзья которые играют в груз мечом это скорее всего происходит где-то в америке возможно в калифорнии на пляже девушка в топе и в шортах пытается словить мяч также двое парней смотрят одинаково думает как перехватить следующую подачу меча на заднем фоне видны высокие пальмы стоят дома неба голубое песок чистой чётко написки отображаются силой этой людей у парня в дали одеты солнце защитные очки он также в шортах и в майке в близи не видно головы человека он одет в темные шорты и в серую фортболку

— Runtime: 18.045 sec. —

Example 4 (Text Analysis - ``RU’’)

[10]:
# Text
path = '''
На картинке изображены скорее всего друзья, которые играют в игру с мячом.
Это скорее всего происходит где-то в Америке, возможно, в Калифорнии на пляже.
Девушка в топе и в шортах пытается словить мяч. Также двое парней смотрят, один активно думает,
как перехватить следующую подачу мяча. На заднем фоне видны высокие пальмы. Стоят дома.
Небо голубое. Песок чистый. Чётко на песке отображаются силуэты людей. У парня вдали одеты солнцезащитные очки,
он также в шортах и в майке. Вблизи не видно головы человека. Он одет в тёмные шорты и в серую футболку.
'''

hc_features, nn_features = _b5.get_text_features(
    path = path, # Text
    asr = False, # Using a model for ASR
    lang = 'ru', # Language selection for models trained on First Impressions V2 'en' and models trained on for MuPTA 'ru'
    show_text = True, # Text display
    out = True,       # Display
    runtime = True,   # Runtime calculation
    run = True        # Run blocking
)

[2023-12-03 00:30:43] Extraction of features (hand-crafted and deep) from a text …

**[2023-12-03 00:30:52] Statistics of extracted features from the text: **    Dimension of the matrix of hand-crafted features: 365 ✕ 64    Dimension of the matrix of deep features: 414 ✕ 768 Text:        на картинке изображены скорее всего друзья которые играют в игру с мячом это скорее всего происходит где-то в америке возможно в калифорнии на пляже девушка в топе и в шортах пытается словить мяч также двое парней смотрят один активно думает как перехватить следующую подачу мяча на заднем фоне видны высокие пальмы стоят дома небо голубое песок чистый чётко на песке отображаются силуэты людей у парня вдали одеты солнцезащитные очки он также в шортах и в майке вблизи не видно головы человека он одет в тёмные шорты и в серую футболку

— Runtime: 9.227 sec. —

Example 5 (Text Analysis - ``EN’’)

[11]:
# Text
path = '''
today says they to for that but right now i am just watching super girl a new images be catching up
and some shows a good say you guys
'''

hc_features, nn_features = _b5.get_text_features(
    path = path, # Text
    asr = False, # Using a model for ASR
    lang = 'en', # Language selection for models trained on First Impressions V2 'en' and models trained on for MuPTA 'ru'
    show_text = True, # Text display
    out = True,       # Display
    runtime = True,   # Runtime calculation
    run = True        # Run blocking
)

[2023-12-03 00:30:52] Extraction of features (hand-crafted and deep) from a text …

**[2023-12-03 00:30:53] Statistics of extracted features from the text: **    Dimension of the matrix of hand-crafted features: 89 ✕ 64    Dimension of the matrix of deep features: 104 ✕ 768 Text:        today says they to for that but right now i am just watching super girl a new images be catching up and some shows a good say you guys

— Runtime: 0.247 sec. —

Example 5 (Analysing a text file - ``EN’’)

[12]:
# Text
path = '/Users/dl/GitHub/OCEANAI/docs/source/user_guide/notebooks/glgfB3vFewc.004.txt'

hc_features, nn_features = _b5.get_text_features(
    path = path, # Text
    asr = False, # Using a model for ASR
    lang = 'en', # Language selection for models trained on First Impressions V2 'en' and models trained on for MuPTA 'ru'
    show_text = True, # Text display
    out = True,       # Display
    runtime = True,   # Runtime calculation
    run = True        # Run blocking
)

[2023-12-03 00:30:53] Extraction of features (hand-crafted and deep) from a text …

**[2023-12-03 00:30:53] Statistics of extracted features from the text: **    Dimension of the matrix of hand-crafted features: 89 ✕ 64    Dimension of the matrix of deep features: 104 ✕ 768 Text:        during those times i feel sad i feel confused and

— Runtime: 0.204 sec. —