Is there a de-duplicator that can handle approximately 600,000 records?

View Only

last person joined: 3 days ago

Charge: To promote and develop competencies around evidence synthesis including systematic reviews, meta-analyses, scoping reviews, and other related methods of research synthesis, through activities such as: Facilitating discussion and peer-support; Creating and managing a resource page; Encouraging programming and publications around systematic reviews through ACRL.
Community members can post as a new Discussion or email ALA-acrlesmig@ConnectedCommunity.org
Before you post: please note job postings are prohibited on ALA Connect. Please see the Code of Conduct for more information.

Back to discussions

Expand all | Collapse all

1. Is there a de-duplicator that can handle approximately 600,000 records?

Recommend
Scott Hertzberg
Posted Apr 03, 2025 03:39 PM

Reply Reply Privately

Print Message
Hello Fellow Librarians,

Does anyone know a de-duplicator that handles a RIS file with so many records? TERA (the SR-Accelerator) and EndNote can't . It is not for a research synthesis project but for a bibliometric study of forensic science literature. The researcher who asked me has compiled a giant RIS file sourced from multiple databases. I told her that she would likely have to break the file up into manageable sections, perhaps a lot of sections. Thanks for any suggestions.

Scott Hertzberg

National Criminal Justice Reference Service

------------------------------
Scott Hertzberg
Librarian
------------------------------
2. RE: Is there a de-duplicator that can handle approximately 600,000 records?

Recommend
Sarah Young
Posted Apr 07, 2025 12:54 PM

Reply Reply Privately

Print Message
Hi Scott,

This sounds like something that might be best run on a local machine if you have enough computing power, though that would probably take a long time. I wonder if you can do it in batches? Do you expect a lot of duplicates?

In terms of a local option, I'm thinking of R packages like ASySD. You can see how to install it and the relatively simple script to run at the bottom of this page. I think 600,000 would take a very long time though. So maybe you can split it into 50,000 record batches, and then combine those deduplicated files and re-deduplicate, if that makes sense?

Good luck...that's a ton of records!

Sarah

Original Message
3. RE: Is there a de-duplicator that can handle approximately 600,000 records?

Recommend
Scott Hertzberg
Posted Apr 10, 2025 10:07 AM

Reply Reply Privately

Print Message
Hi Sarah, thank you for your advice. Fortunately, the researcher just asked me if I could suggest software that can handle the job. I don't have to get my head around figuring out ASySD. I think she uses R packages for statistics and will probably not have much trouble learning ASySD. Whether we have the computing power at our agency is another question. If she has to break references into batches of 50k or less will be fine. I am going to send her your message verbatim.

I visited the University of Albany this past weekend with my son for admitted students day there. I learned that the SUNY Albany Library and Info Science program is in the Department of Informatics. I've been out of library school for a long time and work at a cloistered two librarian library. I never heard the term informatics before but now see many universities have programs. Librarians like you who are comfortable using R packages are models for the future of the profession, possessing both traditional librarian skills and broader informatics abilities. SUNY Albany is impressive as well. That's a place with plenty of computing power.

THANKS SARAH

------------------------------
Scott Hertzberg
Librarian
------------------------------

Original Message

Divisions

Round Tables

Evidence Synthesis Methods Interest Group

Is there a de-duplicator that can handle approximately 600,000 records?

Scott HertzbergApr 03, 2025 03:39 PM

Sarah YoungApr 07, 2025 12:54 PM

Scott HertzbergApr 10, 2025 10:07 AM

1. Is there a de-duplicator that can handle approximately 600,000 records?

2. RE: Is there a de-duplicator that can handle approximately 600,000 records?

3. RE: Is there a de-duplicator that can handle approximately 600,000 records?