Hi Sarah, thank you for your advice. Fortunately, the researcher just asked me if I could suggest software that can handle the job. I don't have to get my head around figuring out ASySD. I think she uses R packages for statistics and will probably not have much trouble learning ASySD. Whether we have the computing power at our agency is another question. If she has to break references into batches of 50k or less will be fine. I am going to send her your message verbatim.
I visited the University of Albany this past weekend with my son for admitted students day there. I learned that the SUNY Albany Library and Info Science program is in the Department of Informatics. I've been out of library school for a long time and work at a cloistered two librarian library. I never heard the term informatics before but now see many universities have programs. Librarians like you who are comfortable using R packages are models for the future of the profession, possessing both traditional librarian skills and broader informatics abilities. SUNY Albany is impressive as well. That's a place with plenty of computing power.
THANKS SARAH
------------------------------
Scott Hertzberg
Librarian
------------------------------
Original Message:
Sent: Apr 07, 2025 12:53 PM
From: Sarah Young
Subject: Is there a de-duplicator that can handle approximately 600,000 records?
Hi Scott,
This sounds like something that might be best run on a local machine if you have enough computing power, though that would probably take a long time. I wonder if you can do it in batches? Do you expect a lot of duplicates?
In terms of a local option, I'm thinking of R packages like ASySD. You can see how to install it and the relatively simple script to run at the bottom
of this page. I think 600,000 would take a very long time though. So maybe you can split it into 50,000 record batches, and then combine those deduplicated files and re-deduplicate, if that makes sense?
Good luck...that's a ton of records!
Sarah
Original Message:
Sent: 4/3/2025 4:39:00 PM
From: Scott Hertzberg
Subject: Is there a de-duplicator that can handle approximately 600,000 records?
Hello Fellow Librarians,
Does anyone know a de-duplicator that handles a RIS file with so many records? TERA (the SR-Accelerator) and EndNote can't . It is not for a research synthesis project but for a bibliometric study of forensic science literature. The researcher who asked me has compiled a giant RIS file sourced from multiple databases. I told her that she would likely have to break the file up into manageable sections, perhaps a lot of sections. Thanks for any suggestions.
Scott Hertzberg
National Criminal Justice Reference Service
------------------------------
Scott Hertzberg
Librarian
------------------------------