4. Formatting transcription process

If you are completing the transcription training process, ignore steps 1-4, since the folder will contain the samples and templates that you will work from. 

  1. Go to Connected Speech Data https://utexas.box.com/s/uz6206lel0544c3auou0nsw57egcv5q6
  2. Choose either Therapy Trial or Observational depending on the sample type
  3. Choose Spanish, Catalan, or English, depending on the sample language that you are formatting
  4. Choose clipped audio of tasks
  5. Choose the task that you are formatting. For example, if it's the WAB picnic description, choose S3_PicnicScene_Picture Description. 
  6. Go to "Taskname_formatted_for_clan." For example, if it's the WAB picnic description, choose 2. PicnicScene_formatted_for_clan
  7. Once in that folder, you should see the template named according to the task. For example: CODE001_BACC001_PicnicScene_Spa_Timepoint_YYYYMMDD
  8. This template helps you name the file and contains the headers that you will need for the transcription.
  9. Copy paste the whisper transcription in this .cha file


      1. You will need to fill out some fields in the headers of the file (headers are the first few lines in the document starting with @)
    1. Preserve the format of the headers
      1. There should be one tab after the colon in each header, so if it gets deleted, put a single tab back in
      2. There are a specific number of "pipes" (this symbol: |) in the @ID header, so don't delete any
      3. Don't add spaces in any of the fields where there are none in the template
      4. Fill out the fields in the headers:
        1. Language
          1. This should be included in the template according to the language you previously chose. If not, enter the language of the sample (in lowercase):
            1. spa (Spanish)
            2. cat (Catalan)
            3. eng (English)
        2. Participant Code: Enter the participant's code here, e.g. BISE016. You will also need to enter the correct BACC### (Not applicable for local participants. If it's a local participant, only enter participant code). You can find this information in the Connected Speech Data Analysis Smartsheet. https://app.smartsheet.com/sheets/q86RQPjgG33Q7c3PjxqG4P64c37JrJ3gjcfr5g71
        3. Timepoint: Enter LRT or VISTA if Participant's code name is BILP or BISE respectively. Also, add the timepoint (e.g., pre, mid, post, etc.). 

          LRT_PreVISTA_PreObs_1
          LRT_MidVISTA_MidObs_2
          LRT_PostVISTA_Post
          LRT_6mVISTA_6m
          LRT_12mVISTA_12m
        4. Time Duration: Enter the start and end time of the sample
        5. If you used a timer, enter 00:00:00 for the start
        6. Preserve the format of the time as indicated
        7. Note the timecode of the video at the onset of the first word a participant says after the clinician prompt (excluding any words you are omitting from the beginning, as discussed above)
        8. Note the timecode at the offset of the last word the participant says on the script topic/picture description
        9. If the clinician redirects or re-prompts during the probe, omit the duration of this from the total duration
      5. Name of script/sample
        1. After the "comment" header, type the name of the script topic, or type the title of the discourse sample, e.g., "PicnicScene" for the WAB picnic description
            1. @Comment: PicnicScene
            2. @Comment: CatRescue
            3. @Comment: ImportantEvent
    2. After the Comment header, the transcription starts. Each utterance has to appear after *PAR: and the text it needs to be a TAB (so it's the larger spacing that needs to be present). Do not leave a space between *PAR: and the text.


    1. Filename Format
      1. The transcription file should be saved in the following format: CODE###_BACC###_TaskName_Language_Timepoint_Date
        1. For the Language it should be Spa pending:
          revise Spanish transcriptions and change naming from "Span" to "Spa" for Spanish, Cat for Catalan, or Eng for English
        2. For the timepoint, it should be: Pre, Mid, Post, 6m, 12m followed by the number of the probe
        3. You will find the date in the date of administration column in the Connected Speech Data Analysis Smartsheet. It is very important that you carefully follow the format of YYYYMMDD. https://app.smartsheet.com/sheets/q86RQPjgG33Q7c3PjxqG4P64c37JrJ3gjcfr5g71
        4. NameofScriptORDiscourseTask
          1. NameofScript For VISTA, for example, the second script probe of the script "My Hobbies" during post-treatment for SE001 would be:
            1. CODE###_BACC###_NameOfScript_Language_Timepoint_YYYYMMDD
            2. For example: BISE018_BACC001_MyHobbies_Spa_Post_20240525
          2. NameofDiscourseTask should be:
            1. CODE###_BACC###_Taskname_Language_Timepoint_YYYYMMDD
            2. BILP022_BACC002_PicnicScene_Cat_Pre_230517.cha
            3. BISE011_BACC004_CatRescue_Spa_Obs1_230925.cha
        5. !! Don't include spaces in the filename
        6. For local participants (not Barcelona), follow this file naming format: (CODE001_Taskname_Language (Spa, Eng)_Timepoint_Date)


Please let us know when this is complete and update the Connected Speech Data analysis smartsheet