Archives in the Digital Age. Abderrazak Mkadmi

Чтение книги онлайн.

Читать онлайн книгу Archives in the Digital Age - Abderrazak Mkadmi страница 10

Archives in the Digital Age - Abderrazak Mkadmi

Скачать книгу

      1.2.2.2. Document pre-processing

      After passing the documents through a scanner, the result is always a file in an image format. The nature of these images depends on the scanned original documents and on the subsequent processing. These images can be, according to requirements, in black and white (or converted to black and white), in dark or light gray or in color. Color images can be 8, 16, 24, 30 or 36 bits. Each time the resolution increases, the clarity and size of the image increases.

       – Compression: It consists of reducing the size of files, thus reducing the space used on archiving media and facilitating their circulation on networks. Several compression methods exist, depending on the scanning method and the nature of the original documents:- CCITT2 G3/G4 compression, also known as “G4” or “modified reading”, is a lossless image compression method used in Group 4 facsimile machines, as defined in the ITU-T T.63 fax standard. It is only used for bitonal (black and white) images. Group 4 compression is available in many proprietary image file formats, as well as in standard formats such as TIFF (Tagged Image File Format), CALS (Computer-aided Acquisition and Logistics Support), CIT (Combined interrogator transponder, Intergraph Raster Type 24) and PDF (Portable Document Format),- JBIG4 (Joint Bi-level Image Group) compression: this is a two-level compression of an image, in which a single bit is used to express the color value of each pixel. This standard can also be used to code grayscale images and color images with a limited number of bits per pixel. JBIG is designed for images sent using facsimile coding and offers significantly higher compression than Group 3 and 4 facsimile coding,- the JPEG5 algorithm (Joint Picture Expert Group) is used to reduce the size of color images. This format of graphic file allows very important compression rates, but with a weak resolution that influences the quality of the image: the compression entails a loss of information;

       – Optical Character Recognition (OCR): The purpose of OCR is to convert text in image format into a computer-readable text format by translating the groups of dots in a scanned image into characters with the associated formatting. It is carried out by dedicated systems called “OCR”. The challenge today is to find the most efficient OCR among several tools of this type and the best suited to its application. Among the criteria for the choice of the tool, we often evoke the criterion of effectiveness, which is related to a high recognition rate. The objective to be reached is a rate of 100%. However, the recognition rate does not depend solely on the recognition engine, but also on several other measures to be taken into consideration, such as the material preparation of the paper document upstream and the performance of the OCR engine in the parameters used to adapt to the type of content, taking into account, inter alia, the language, quality and layout of the document.

      OCR can be applied within an ERM system in two ways:

      1 1) Application on whole pages in text in order to index them in full text using spell checkers.

      2 2) Application on some areas within the pages (such as titles) in order to use them as an index. Different technologies have existed for a long time and are based on OCR techniques to extract information from these digitized documents and enrich their metadata (category, author, title, date, etc.):- Automatic Document Recognition (ADR), which consists of distinguishing one type of document from another, according to a few pre-defined parameters. This will make it possible to sort images electronically;- Automatic document reading: this technology uses artificial intelligence technologies to perform linguistic checks on recognized words and interpret them using text-mining functions, for the purpose of pre-analysis and/or thematic classification of the scanned documents.

      1.2.2.3. Document indexing

      After having acquired the document through scanning, exchange and/or production, and in order to find it and facilitate its use, it is necessary to describe its content. This second stage of electronic document management is the most important one as regards being able to keep the document and use it later. This operation can be done by type (with a formal description, author, title, date, etc.), by concepts or keywords selected in a free way, or based on a thesaurus in order to harmonize practices. In web documents in HTML format, the description is created through META tags that allow the creator of these documents to define the relevant keywords representative of the content, the subject, the author and so on. There are many metadata6- related standards today, such as DC (Dublin Core), RDF (Resource Description Framework), EAD (Encoded Archival Description), EAC (Encoded Archival Context) and LOM (Learning Object Metadata) [MKA 08]. The objective is to make this metadata usable by a large number of search tools.

      1.2.2.4. Storage of documents

      1.2.2.4.1. Storage media

      Storage, or what is sometimes called archiving (in the primary sense of the term), supports the conservation of documents over time. In order to implement an effective storage solution, it is first necessary to establish a needs analysis related, in particular, to the volume of data, their importance, the frequency of their consultation, the degree of confidentiality, the degree of importance of security, the length of time they are kept and the interest of putting them online, among other factors.

      To facilitate the different needs of this conservation function, an ERM system uses several storage media, according to the following criteria:

       – criteria relating to the document: types of documents, frequency of consultation, interest in having it online and retention periods;

       – criteria relating to the medium: document access time, storage capacity, cost, rewritability or non-rewritability and secure access.

      There are several storage media that can be classified into generations:

       – First generation media are considered to be analog media and have not been used since the late 1990s. This refers to the perforated card and perforated tape system, which originated in the 18th century. Their storage capacity is very small and is measured in a few tens of bytes.

       – Second generation media are magnetic media and have a digital recording mode, except for magnetic tape, which has both analog and digital recording modes. They include magnetic tape, cassette, hard disks, cartridges and diskettes. These media have, however, been able to withstand technological developments over a long period of time [FLE 17].

       – Third

Скачать книгу