Development, validation, and application of cyber-bert: using deep learning for large-scale identification and classification of cybersecurity disclosures in SEC filings

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Pinheiro, José Ricardo Monteiro
Orientador(a): Caldas, Miguel Pinto
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Não Informado pela instituição
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Palavras-chave em Inglês:
NLP
LLM
Link de acesso: https://hdl.handle.net/10438/35562
Resumo: As cybersecurity events have emerged among the top global risks, the necessity of firms providing transparent and timely information about them has become mandatory. This has led to the emergence of a rich cybersecurity disclosure research stream. However, some gaps persist in extant literature, including: (a) the use of small samples or short time spans; (b) binary classification (cybersecurity-related versus non-cybersecurity-related), rather than multiple disclosure categories; (c) the use of a dictionary approach instead of machine learning (ML) or superior Large Language Models (LLMs); (d) a scarcity of studies that include timely 8-K filings in addition to annual 10-K filings; and (e) the lack of cybersecurity disclosure studies thoroughly examining boilerplate patterns. This paper describes the development and validation of a deep learning model called CYBER-BERT, and to address these gaps, illustrates its application locating and categorizing 2.5 million cybersecurityrelated phrases contained in all 10-K and 8-K SEC filings over 18 years (2006–2023). As contributions of the study, beside the toolset (CYBER-BERT and a Cybersecurity BI), results from 4 illustrations of its use showed that (a) firms did not file cybersecurity disclosures as timely as intended by the SEC: 95.5% of disclosures were filed via 10-Ks rather than more timely 8-Ks; (b) cybersecurity disclosures have increased significantly: total disclosures grew 343% and breach disclosures grew 510%; (c) content-wise, two cybersecurity categories exhibited high (vulnerability) and medium (action) boilerplate use, and all categories had low readability in two independent measures; and (d) following a breach disclosure, vulnerability and action disclosure activities increased 97.4% and 77.4%, respectively, compared with the previous year. Moreover, implications for research and practice are discussed.