-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Description of problem
TrID signatures are of a much less quality than something like PRONOM, largely because of the learning algorithm it uses. Partially too because TrID's confidence ratings haven't been translated into Wikidata.
An initial analysis shows a large number of TrID signatures are below 10 bytes in length with over 900 <= 3. All sequences are BOF sequences, and there are approximately ~1500 duplicates.
Number of signatures of length 10 or below:
10. ########## 240
9. ######### 301
8. ######## 894
7. ####### 280
6. ###### 495
5. ##### 376
4. #### 2024
3. ### 335
2. ## 326
1. # 355
So, we need to take a look at these signatures and understand how bad the problem is, and what can be done about it.
Permalink
A good place for discussion, a la the new XML discussion: https://www.wikidata.org/w/index.php?title=Talk:Q41799265&action=edit&redlink=1
Analysis spreadsheet: https://docs.google.com/spreadsheets/d/1M_7K8RpHI2UM2OepUWvNRA4u8KbuEA_o_UH_lO6_4eA/edit?usp=sharing
Related to: https://docs.google.com/document/d/1jMXcbFHVtw8mdNNJNZED7DweXhDvOQ5K2SW_ZVfNJi0/