corpus

Challenge #04: The usage of artificial intelligence-based methods for text data quality improvement

The dataset quality is very important to the correct model development. At the same time, it requires a high scale of human resources to label the data. Text-based dataset (corpus) development task is even more complicated. Human labelling in text dataset cases is very affected by individual interpretations. The usage of multiple experts to label at the same time becomes too complicated when huge datasets are owned. All this requires some methods or solutions, which could help in text-based dataset quality estimation or improvement, indicating suspicious labelling records.

The direction for the challenge ideas, which could be converted into multiple final thesis or individual project topics is the following:

  1. The outlier and noise points detection in textual data analysis.
  2. The automatic class adjustment for the labelled text data.
  3. The research on the efficiency of large language models for multi-label text data quality improvement.
  4. The product review evaluation adjustment based on natural language processing and statistics methods.

Regarding more details on the challenge and topics contact Pavel Stefanovič, VILNIUS TECH.