Summary
Some rows in some tables from the 2023 data collection are duplicated.
Discovered: 2024-02
Timeline to repair: TBD
Github issue: TBD
Log: 2025-01-remove-duplicates
Curation challenge
In February of 2024, we discovered a bug where, if users pressed the "submit" button twice in rapid successesion, some data in the FAC database could be duplicated. Investigation determined that this did not impact the internal records of the SF-SAC collection, but it did impact how the data was copied out to our dissemination data tables. The dissemination tables power our search and API, and therefore impact all users of the data.
Example
Some records in the federal_awards
and findings
tables can be replicated because of this bug. For example, report 2023-06-GSAFAC-0000020785 encountered this bug. As a result, its award rows were duplicated. Below is a subset of rows from the federal_awards
data table demonstrating how a set of awards were listed twice.
report_id |
reference | prefix | extension |
---|---|---|---|
2023-06-GSAFAC-0000020785 | AWARD-0001 | 93 | 959 |
2023-06-GSAFAC-0000020785 | AWARD-0002 | 93 | 959 |
2023-06-GSAFAC-0000020785 | AWARD-0003 | 93 | 778 |
2023-06-GSAFAC-0000020785 | AWARD-0001 | 93 | 959 |
2023-06-GSAFAC-0000020785 | AWARD-0002 | 93 | 959 |
2023-06-GSAFAC-0000020785 | AWARD-0003 | 93 | 778 |
Our data validation process would not allow a user to enter this data this way; each AWARD-####
can only appear once. This error occurred when the data was copied into our dissemination tables (used by search and the API). Anyone requesting the federal awards for report GSA-20785
will get two AWARD-0001
rows, two AWARD-0002
rows, and so on.
Consequences
Searches for audits are not impacted, but calculations over the awards may yield incorrect results.
Mediating the error
- In the
federal_awards
table, a givenAWARD
reference should only occur once. Filtering this down to a unique set will yield correct results. - In the
findings
table, a givenAWARD
andREF
should only appear once in combination; filtering this combination of identifiers down to a unique set will yield correct results.
Possible resolution
Remove the records from dissemination that contain this error, and re-disseminate the original record. This will yield correct data in the public-facing record.