Skip to content

TrID analysis #32

@ross-spencer

Description

@ross-spencer

Description of problem

TrID signatures are of a much less quality than something like PRONOM, largely because of the learning algorithm it uses. Partially too because TrID's confidence ratings haven't been translated into Wikidata.

An initial analysis shows a large number of TrID signatures are below 10 bytes in length with over 900 <= 3. All sequences are BOF sequences, and there are approximately ~1500 duplicates.

Number of signatures of length 10 or below:

10. ########## 240
 9. ######### 301
 8. ######## 894
 7. ####### 280
 6. ###### 495
 5. ##### 376
 4. #### 2024
 3. ### 335
 2. ## 326
 1. # 355

So, we need to take a look at these signatures and understand how bad the problem is, and what can be done about it.

Permalink

A good place for discussion, a la the new XML discussion: https://www.wikidata.org/w/index.php?title=Talk:Q41799265&action=edit&redlink=1

Analysis spreadsheet: https://docs.google.com/spreadsheets/d/1M_7K8RpHI2UM2OepUWvNRA4u8KbuEA_o_UH_lO6_4eA/edit?usp=sharing

Related to: https://docs.google.com/document/d/1jMXcbFHVtw8mdNNJNZED7DweXhDvOQ5K2SW_ZVfNJi0/

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions