Seperates author information in references files from references_read
Source: R/authors_clean.R
authors_clean.Rd
authors_clean
This function takes the output from
references_read
and cleans the author information.
Details
Information on addresses, emails, ORCIDs, etc are matched.
It then attempts to match same author entries together into likely author groups based on common full names, addresses, emails, ORCIDs etc.
Records that are not matched this way have a Jaro-Winkler similiarty analysis metric calculated for all possible matching author names.
This calculates the amount of character similarities based on distance of similar character.
Examples
## Load the refsplitr sample dataset "BITR"
data(BITR)
BITR_clean <- authors_clean(BITR)
#>
#> Splitting author records
#>
|
| | 0%
|
|======= | 10%
|
| | 0%
|
|============== | 20%
|
| | 0%
|
|===================== | 30%
|
| | 0%
|
|============================ | 40%
|
| | 0%
|
|=================================== | 50%
|
| | 0%
|
|========================================== | 60%
|
| | 0%
|
|================================================= | 70%
|
| | 0%
|
|======================================================== | 80%
|
| | 0%
|
|=============================================================== | 90%
|
| | 0%
|
|======================================================================| 100%
#>
#> Splitting addresses
#>
|
| | 0%
|
|===== | 7%
|
| | 0%
|
|========== | 14%
|
| | 0%
|
|=============== | 21%
|
| | 0%
|
|==================== | 29%
|
| | 0%
|
|========================= | 36%
|
| | 0%
|
|============================== | 43%
|
| | 0%
|
|=================================== | 50%
|
| | 0%
|
|======================================== | 57%
|
| | 0%
|
|============================================= | 64%
|
| | 0%
|
|================================================== | 71%
|
| | 0%
|
|======================================================= | 79%
|
| | 0%
|
|============================================================ | 86%
|
| | 0%
|
|================================================================= | 93%
|
| | 0%
|
|======================================================================| 100%
#>
#> Matching authors
#>
|
| | 0%
|
|= | 2%
|
| | 0%
|
|=== | 4%
|
| | 0%
|
|==== | 5%
|
| | 0%
|
|===== | 7%
|
| | 0%
|
|====== | 9%
|
| | 0%
|
|======== | 11%
|
| | 0%
|
|========= | 13%
|
| | 0%
|
|========== | 15%
|
| | 0%
|
|=========== | 16%
|
| | 0%
|
|============= | 18%
|
| | 0%
|
|============== | 20%
|
| | 0%
|
|=============== | 22%
|
| | 0%
|
|================= | 24%
|
| | 0%
|
|================== | 25%
|
| | 0%
|
|=================== | 27%
|
| | 0%
|
|==================== | 29%
|
| | 0%
|
|====================== | 31%
|
| | 0%
|
|======================= | 33%
|
| | 0%
|
|======================== | 35%
|
| | 0%
|
|========================= | 36%
|
| | 0%
|
|=========================== | 38%
|
| | 0%
|
|============================ | 40%
|
| | 0%
|
|============================= | 42%
|
| | 0%
|
|=============================== | 44%
|
| | 0%
|
|================================ | 45%
|
| | 0%
|
|================================= | 47%
|
| | 0%
|
|================================== | 49%
|
| | 0%
|
|==================================== | 51%
|
| | 0%
|
|===================================== | 53%
|
| | 0%
|
|====================================== | 55%
|
| | 0%
|
|======================================= | 56%
|
| | 0%
|
|========================================= | 58%
|
| | 0%
|
|========================================== | 60%
|
| | 0%
|
|=========================================== | 62%
|
| | 0%
|
|============================================= | 64%
|
| | 0%
|
|============================================== | 65%
|
| | 0%
|
|=============================================== | 67%
|
| | 0%
|
|================================================ | 69%
|
| | 0%
|
|================================================== | 71%
|
| | 0%
|
|=================================================== | 73%
|
| | 0%
|
|==================================================== | 75%
|
| | 0%
|
|===================================================== | 76%
|
| | 0%
|
|======================================================= | 78%
|
| | 0%
|
|======================================================== | 80%
|
| | 0%
|
|========================================================= | 82%
|
| | 0%
|
|=========================================================== | 84%
|
| | 0%
|
|============================================================ | 85%
|
| | 0%
|
|============================================================= | 87%
|
| | 0%
|
|============================================================== | 89%
|
| | 0%
|
|================================================================ | 91%
|
| | 0%
|
|================================================================= | 93%
|
| | 0%
|
|================================================================== | 95%
|
| | 0%
|
|=================================================================== | 96%
|
| | 0%
|
|===================================================================== | 98%
|
| | 0%
|
|======================================================================| 100%
#>
#> Pruning groupings...
## The output of authors_clean is a list with two elements,
## which can be assigend to dataframes.
BITR_review_df <- BITR_clean$review
BITR_prelim_df <- BITR_clean$prelim
## Users can save the these dataframes outside of R as .csv files.
## The "review_df.csv" is then used to review the groupID or authorID
## assignments and make any necessary corrections.
## The function "authors_refine" is used to load and merge the changes
## into R and create a dataframe used for analyses.