DarkBERT
Researchers from South Korea have made the extremely rare decision to create and train artificial intelligence (AI) using the dark web for data with the aim of using it to shed light on how to prevent cybercrime.

DarkBERT – The New Al Model

The internet has a portion known as the “Dark Web” which is hidden and inaccessible through regular web browsers because the links to these pages are yet to be indexed by the search engine.

Since this area of the internet is untracked, it is well-known for its anonymous websites and is mostly used to host markets that enable illegal operations like the trade in drugs and weapons, the sale of stolen data, and serving as a shelter for hackers to facilitate cybercrime.

Researchers from the Korea Advanced Institute of Science and Technology (KAIST) in conjunction with the data intelligence group, S2W, have released DarkBERT, a generative AI language model that has been trained only on datasets derived from the dark web.

DarkBERT was then set loose to scour and index anything it could uncover on the dark web in order to inform how to better deal with cybercrime in this part of the internet.

While it is yet to be peer-reviewed, the researchers published a paper titled “DarkBERT: A language model for the dark side of the Internet,” which described in detail the development and experiment process for this Large Language Model(LLM).

To create a dataset for the model, the research team compiled a sizable database by crawling the Tor network, which is specialized software used to access the dark web, in order to optimize how DarkBERT adjusts to the language used on the dark web.

The database then underwent deduplication, data filtering, and pre-processing in an effort to ease ethical concerns about the dark web’s sensitive information-filled content. This removed organizations’ names, information about data leaks, threat comments, and illicit photos.

While DarkBERT is a new artificial intelligence model, it was built on the RoBERTa architecture, an approach for AI that Facebook researchers came up with in 2019.

RoBERTa is an improvement over Google’s BERT (Bidirectional Encoder Representations from Transformers), and Facebook’s researchers enhanced its performance after it was released as open source. A research paper that explains how RoBERTa works describes it as a “robustly optimized method for pretraining natural language processing (NLP) systems.”

Popular AI YouTuber Matthew Berman dove into the paper in more depth here:

AI To Fight Against Cybercrime

In the research paper for DarkBERT, the team discovered that their Large Language Model was much better at understanding the dark web compared to other models trained for similar tasks, like RoBERTa, which was designed to “predict intentionally hidden parts of text within otherwise unmarked language samples.”

The researchers said:

“Our evaluation results show that DarkBERT-based classification model outperforms that of known pre-trained language models.”

They also said that DarkBERT could potentially be used to aid in cybersecurity tasks such as identifying websites that sell or publish private, confidential data of organizations leaked by ransomware groups.

It could additionally be used to scour through the many forums on the dark web which are updated daily and watch out for any exchange of illegal information.

DarkBert won’t be accessible to the general public for a while due to the possibly dangerous nature of dark web content. Requests for the usage of the AI model for scholarly endeavors are now permissible, nonetheless.

That doesn’t mean DarkBERT is complete since like with other LLMs, additional training and fine-tuning might still enhance its performance. What can be learned from it and how it will be applied are still unknown.

What's the Best Crypto to Buy Now?

  • B2C Listed the Top Rated Cryptocurrencies for 2023
  • Get Early Access to Presales & Private Sales
  • KYC Verified & Audited, Public Teams
  • Most Voted for Tokens on CoinSniper
  • Upcoming Listings on Exchanges, NFT Drops