[LFG] [Deadline extension]Final call for participation FinTOC shared task @Nodalida'19
Sira Ferradans
sira.ferradans at fortia.fr
Fri Aug 2 08:42:21 UTC 2019
*Call for participation - FinTOC shared task*
⇒ *The Second Financial Narrative Processing Workshop (FNP 2019)*
⇒ *The 22nd Nordic Conference on Computational Linguistics (NoDaLiDa’19)*
*Task*: Predict a Table of Content (ToC) from financial documents.
Two sub-tasks are proposed :
-
Detection of titles
-
Prediction of a ToC
Shared task webpage: http://wp.lancs.ac.uk/cfie/shared-task/
<https://gmail.us20.list-manage.com/track/click?u=9b9c52fc6d2c60970cdc072fa&id=2e8113bc5b&e=9aba78199b>
Shared task contact: fin.toc.task at gmail.com
*Important dates*
*Submission deadline: Aug. 2, 2019*
Workshop day: September 30, 3019
*More reading* 👇
*“Financial Document Structure Extraction”*
*Introduction:*
A vast amount of financial documents are created and published constantly
in machine-readable formats (generally PDF file format), with only minimal
structure information. Firms use such documents to report their activities,
financial situation or potential investment plans to shareholders,
investors and the financial markets, basically corporate annual reports
containing detailed financial and operational information.
In some countries as in the US or in France, regulators as EDGAR SEC or AMF
require firms to follow a certain template when reporting their financial
results to insure standardisation and consistency across firms’
disclosures. In other European countries, on the other hand, the management
usually have more discretion on what where and how to report resulting in
lack of standardisation between financial documents published within the
same market.
In this shared task, we focus on analysing Financial Prospectuses; official
PDF documents in which investment funds precisely describe their
characteristics and investment modalities. Although the content they must
include is often regulated, their format is not standardized and displays a
great deal of variability ranging from plain text format, towards more
graphical and tabular presentation of data and information. The majority of
prospectuses are published without a table of content (TOC), which is
usually needed to help readers to navigate within the document by following
a simple outline of headers and page numbers, and assist legal teams in
checking if all the contents required are fully included. Thus, automatic
analyses of prospectuses to extract their structure is becoming more and
more vital to many firms across the world.
------------------------------
*Task:*
As part of the Financial Narrative Processing Workshop, we present a shared
task on Financial Document Structure Extraction.
Systems participating in this shared task will be given a sample collection
of financial prospectuses with different level of structure and different
lengths (document sizes), which are to be automatically analyzed to extract
structural information and build a table of content.
The task will contain two sub tasks are:
a) Title detection
This is a binary classification task aiming at detecting titles in
financial prospectuses. Given a set of text blocks, the goal is to classify
each given text block as a ‘title’ or ‘non-title’. As shown in Figure 1 the
titles can have different layouts (marked with red and green boxes) and
they have to be distinguished from the regular text (‘non-title’ with grey
boxes).
<http://wp.lancs.ac.uk/cfie/files/2018/10/sharedTask-1.png>
Click to show full sized image.
<http://wp.lancs.ac.uk/cfie/files/2018/10/sharedTask-1.png>
b) TOC structure extraction
The TOC is a hierarchical organisation of the headers of a document. In
this subtask, we provide only the headers of a prospectus, and the goal is
to (i) identify the hierarchical level of the header (ii) organize the
headers of the document according to this hierarchical structure. Note that
two headers, with the same layout and the same text can have different
hierarchical levels depending on their location in the document.
Participants need to register. Once registered, all participating teams
will be provided with a common training dataset, which includes common
pre-processed input and corrected output. A common development set will
also be provided. A blind test data set will be used to evaluate the output
of the participating teams. An evaluation script will be provided to all
the teams. In addition to the PDF version of the documents, we will provide
their XML representation.
------------------------------
*Background:*
Existing work on book and document table of contents (TOC) recognition has
been almost all on small size, application-dependent, and domain-specific
datasets. However, TOC of documents from different domains differ
significantly in their visual layout and style, making TOC recognition a
challenging problem for a large scale collection of heterogeneous documents
and books. Compared to regular books (mostly provided in a full text format
with limited structural information such as pages and paragraphs),
Financial documents, containing textual and non textual content, have a
more sophisticated structure including, parts, sections, sub-sections,
sub-sub-sections.
------------------------------
*Data Format and **Evaluation**:*
The following pdf file describes the data format and evaluation metric used
in the shared task: Data Format Details
<https://docs.google.com/document/d/1gYRS1wvNrm5DT68W-Jn7LpgfomrbP_diqQYAHM5tA0A/edit>
------------------------------
*Important Dates:*
- Aug 2,2019: continue collecting
- Aug 5,2019: results publication
- Aug 10,2019: Deadline for papers
- Aug 17,2019: reviews and notification of acceptanc
*e *
- Aug 24, 2019: camera ready version of shared task system papers due
- Sep 30, 2019: Workshop day
------------------------------
*Shared Task Organisers:*
- Dr Sira Ferradans
<https://scholar.google.fr/citations?user=vGXfCl0AAAAJ&hl=fr>, Fortia
Financial Solutions
- Najah-Imane Bentabet
<https://www.linkedin.com/in/najah-imane-bentabet-7182b456/>, Fortia
Financial Solutions
- Dr Mahmoud El-Haj <http://www.lancaster.ac.uk/staff/elhaj>, Lancaster
University
- Rémi Juge <https://www.linkedin.com/in/jugeremi/>, Fortia Financial
Solutions
------------------------------
*Shared Task Contact:*
Questions about FinTOC-2019 shared task can be sent to:
fin.toc.task at gmail.com
--
*Sira FERRADANS* Chief Research Scientist
17, avenue George V. Paris 75008
+33 (0)6 73 77 20 03
sira.ferradans at fortia.fr | www.fortia.fr
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listserv.linguistlist.org/pipermail/lfg/attachments/20190802/4bd6a960/attachment.htm>
More information about the LFG
mailing list